Closed jkin2003 closed 4 years ago
Hi @jkin2003 ,
Could you please provide the command you are running to download only unmapped sequences?
You can test fetching unmapped reads using a file available from our test dataset. Instructions for accessing the test dataset can be found here ("Testing the pyEGA3 Download Client" section).
Using the test dataset, you should be able to get unmapped reads from the BAM file with ID EGAF00001753746 using the following command:
pyega3 -t fetch --reference-name '*' --format BAM --saveto EGAF00001753746-unmapped-slice.bam EGAF00001753746
Ms. Freeburg,
Thank you for your suggestion, and it makes perfect sense according to the documentation. Unfortunately the syntax you provided (and that I attempted) executes a download of the entire BAM file. Furthermore, any unrecognized syntax in quotations will also result in downloading the entire BAM file.
After many helpful emails, the EGA help desk consulted with their engineers, and this was the answer I received.:
"Thank you for your patience as we worked to answer your query. Pyega3 support for htsget is still a work in progress and not all features are fully available. Unfortunately, htsget does not currently support the downloading of unmapped slices, you will need to download the bam file as a whole.
I hope this clarifies things, but please let me know if you have any other questions."
So at this time it seems the only option is to download the entire BAM file collection and extract the unmapped regions with samtools.
The current documentation outlines how to download mapped regions of a BAM, but does address UNmapped regions, while it does mention to refer to HTSGET protocol, it is seems from the email quoted above, as well as my many repeated attempts that this function is not available at this time. I do think it would be helpful to address the current limitations of this function in the documentation.
Hi @jkin2003,
I'm glad to hear the the EGA Helpdesk could provide you with more information. Indeed, as the documentation suggests, the current implementation only supports "The reference sequence name, for example 'chr1', '1', or 'chrX'. If unspecified, all data is returned." and does not yet support fetching unmapped reads.
I agree that it would be helpful to indicate that support for fetching unmapped reads is in progress. We have scheduled work to improve this in our internal ticketing system, so I will close this issue now. Thanks again for bringing this to our attention. Your feedback is very helpful and much appreciated :)
We'd be really interested in this functionality too. Any update on this?
Hi @danielsbrewer,
Thank you for expressing interest in this suggestion for improving the pyega3 download client. Adding new features to pyega3 - like this one - are on our roadmap for 2021, so we hope to be able to make progress on this and other useful features this year.
-r, reference-name=, reference-name '*' all result in downloading the entire BAM file instead of a slice of only unmapped sequences.
How do I download slices of unmapped sequences?
Originally posted by @jkin2003 in https://github.com/EGA-archive/ega-download-client/issues/37#issuecomment-638491759