Sage-Bionetworks / schematic

Package for biomedical data model and metadata ingress management
https://schematicpy.readthedocs.io/en/latest/cli_reference.html
MIT License
21 stars 24 forks source link

Expedite schematic tests #1028

Open linglp opened 1 year ago

linglp commented 1 year ago

Is your feature request related to a problem? Please describe. Currently schematic tests take quite some time to run, and we have to wrap the tests related to submit around tenacity. Find a way to expedite our tests.

For context, by using tenacity, here's the running log of the tests: https://github.com/Sage-Bionetworks/schematic/actions/runs/3498522060.

Without tenacity, we would run into error like this:

synapseclient.core.exceptions.SynapseHTTPError: 412 Client Error: 
Object: syn44262071 was updated since you last fetched it, retrieve it again and re-apply the update

Issue related to the error message above could be tracked here: https://sagebionetworks.jira.com/browse/SYNPY-1239?atlOrigin=eyJpIjoiOTRjZTgzYWYyMTQ0NGQ1M2FiYjE2YTk0NjI2YTFkOGEiLCJwIjoiamlyYS1zbGFjay1pbnQifQ

See full log here: https://github.com/Sage-Bionetworks/schematic/actions/runs/3492110152/jobs/5845521281

Describe the solution you'd like A clear and concise description of what you want to happen.

How important is this feature? Select from the options below: • 🌗 Medium - can do work without it; but it's important (e.g. to save time or for convenience)

When will use cases depending on this become relevant? Select from the options below: • Mid-term - 2-4 months

Additional context Add any other context or screenshots about the feature request here.

linglp commented 1 year ago

Here's the run time of some of the slowest tests: (see full log here: https://github.com/Sage-Bionetworks/schematic/actions/runs/3518140105/jobs/5896721560#:~:text=html%3Ahtmlcov%20%2D%2Dcov%3Dschematic-,/,-api%20%2D%2Ddurations%3D0)

============================== slowest durations ===============================
61.16s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[Patient-excel]
51.94s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-excel]
50.50s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[Biospecimen-excel]
48.48s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[None-False]
47.08s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-dataframe (only if getting existing manifests)]
46.64s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-google_sheet]
46.10s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-None]
43.75s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[None-True]
41.45s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[Sample_A.txt-False]
39.40s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[Sample_A.txt-True]
38.07s call     tests/test_api.py::TestManifestOperation::test_submit_manifest[[*** "Patient ID": [123](https://github.com/Sage-Bionetworks/schematic/actions/runs/3518140105/jobs/5896721459#step:8:123), "Sex": "Female", "Year of Birth": "", "Diagnosis": "Healthy", "Component": "Patient", "Cancer Type": "Breast", "Family History": "Breast, Lung", ***]]
34.90s call     tests/test_api.py::TestManifestOperation::test_submit_manifest[None]
linglp commented 1 year ago

Here's the run time of some of the slowest tests: (see full log here: https://github.com/Sage-Bionetworks/schematic/actions/runs/3518140105/jobs/5896721560#:~:text=html%3Ahtmlcov%20%2D%2Dcov%3Dschematic-,/,-api%20%2D%2Ddurations%3D0)

============================== slowest durations ===============================
61.16s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[Patient-excel]
51.94s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-excel]
50.50s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[Biospecimen-excel]
48.48s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[None-False]
47.08s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-dataframe (only if getting existing manifests)]
46.64s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-google_sheet]
46.10s call     tests/test_api.py::TestManifestOperation::test_generate_existing_manifest[data_type3-None]
43.75s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[None-True]
41.45s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[Sample_A.txt-False]
39.40s call     tests/test_api.py::TestSynapseStorage::test_get_dataset_files[Sample_A.txt-True]
38.07s call     tests/test_api.py::TestManifestOperation::test_submit_manifest[[*** "Patient ID": [123](https://github.com/Sage-Bionetworks/schematic/actions/runs/3518140105/jobs/5896721459#step:8:123), "Sex": "Female", "Year of Birth": "", "Diagnosis": "Healthy", "Component": "Patient", "Cancer Type": "Breast", "Family History": "Breast, Lung", ***]]
34.90s call     tests/test_api.py::TestManifestOperation::test_submit_manifest[None]

Maybe try mocking these parts.

linglp commented 1 year ago

After changing to submit to different synapse folders (based on versions of Python), here's the run time:

39.49s call     tests/test_api.py::TestManifestOperation::test_submit_manifest[[*** "Patient ID": 123, "Sex": "Female", "Year of Birth": "", "Diagnosis": "Healthy", "Component": "Patient", "Cancer Type": "Breast", "Family History": "Breast, Lung", ***]]
35.79s call     tests/test_api.py::TestManifestOperation::test_submit_manifest[None]
linglp commented 1 year ago

Some of the ideas brought up by @BrunoGrandePhD in discussion that I think are worth documenting:

linglp commented 1 year ago

Bruno presented the idea of using PyFilesystem as the interface of asset store. This is related to the idea of using synapse as one of the asset store, and use PyFilesystem to implement a common interface for testing.