optimise response size when file searching

jethror1 commented 5 months ago

when searching DNAnexus for all BAMs and describe details in a project here: https://github.com/eastgenomics/Genetics_Ark/blob/7a53b8fa0d374234170ede239e37269c23946f88/cron/find_dx_data.py#L124

we could limit the describe fields returned to what is required which should reduce the size of the response to lower bandwidth load and make querying faster, the only things we should need from the describe details are folder, name and archivalState

Example of current response for single object:

>>> list(dxpy.find_data_objects(project='project-GVZFgv84bXV9yKJFYPpk232y', name="*bam", name_mode='glob', de
scribe=True))[0]
{'project': 'project-GVZFgv84bXV9yKJFYPpk232y', 'id': 'file-GVZGFjj408bpQbxY959k9B6Y', 'describe': {'id': 'file-GVZGFjj408bpQbxY959k9B6Y', 'project': 'project-GVZFgv84bXV9yKJFYPpk232y', 'class': 'file', 'sponsored': False, 'name': '2303759-23102Z0008-1-BMT-FLM-MYE-M-EGG2_markdup.bam', 'types': [], 'state': 'closed', 'hidden': False, 'links': [], 'folder': '/output/MYE-230520_1224/sentieon-bwa-3.2.0', 'tags': [], 'created': 1684588003000, 'modified': 1710477353586, 'createdBy': {'user': 'user-bioinformaticsteamgeneticslab', 'job': 'job-GVZG8784bXVK79g71p3kFjBg', 'executable': 'app-G2JJ2j09Pxx8JFgbJfVb3QJQ'}, 'media': 'application/x-gzip', 'archivalState': 'archived', 'size': 857800803, 'cloudAccount': 'cloudaccount-dnanexus'}}

Limiting this to required fields:

>>> list(dxpy.find_data_objects(project='project-GVZFgv84bXV9yKJFYPpk232y', name="*bam", name_mode='glob', de
scribe={'fields':{'folder':True, 'name': True, 'archivalState': True}}))[0]
{'project': 'project-GVZFgv84bXV9yKJFYPpk232y', 'id': 'file-GVZGFjj408bpQbxY959k9B6Y', 'describe': {'id': 'file-GVZGFjj408bpQbxY959k9B6Y', 'name': '2303759-23102Z0008-1-BMT-FLM-MYE-M-EGG2_markdup.bam', 'folder': '/output/MYE-230520_1224/sentieon-bwa-3.2.0', 'archivalState': 'archived'}}

Checking size differences of responses for a random 002 project of BAM files:

>>> asizeof.asizeof(list(dxpy.find_data_objects(project='project-GVZFgv84bXV9yKJFYPpk232y', name="*bam", name_mode='glob', describe=True)))
356440

>>> asizeof.asizeof(list(dxpy.find_data_objects(project='project-GVZFgv84bXV9yKJFYPpk232y', name="*bam", name
_mode='glob', describe={'fields':{'folder':True, 'name': True, 'archivalState': True}})))
142112

>>> 142112 / 356440
0.39869823813264504

This gives ~60% reduction in the size of the response, and the same can be done for the query to find index files too.

In addition, this call to find the CNV index happens for every CNV BAM, where we could search for both first for the whole project, then just match them up by name/path to only make 2 API calls for the CNV BAMs per project: https://github.com/eastgenomics/Genetics_Ark/blob/7a53b8fa0d374234170ede239e37269c23946f88/cron/find_dx_data.py#L303-L310

jethror1 commented 5 months ago

this could be all called in parallel too by project with the following: https://github.com/eastgenomics/dias_reports_bulk_reanalysis/blob/0b7014926a94e3c607c3a5fe54a851a847adf901/bin/utils/utils.py#L10

corbin-chris commented 2 months ago

Closed by https://github.com/eastgenomics/Genetics_Ark/pull/78

eastgenomics / Genetics_Ark

optimise response size when file searching #62