EGA-archive / ega-download-client

A Python-based EGA download client
Apache License 2.0
94 stars 52 forks source link

incomplete read extraction within a genomic range #108

Closed jungch closed 3 years ago

jungch commented 3 years ago

incomplete read extraction within a genomic range (Error with pyega3 fetch)

Description of the bug

Tried to extract reads mapped to mitochondria genome using command below: $ pyega3 -cf credential.json fetch --reference-name MT mt_reads.bam

Sometimes, the downloaded BAM file contains only a subset of the reads of what it was supposed to contain, despite no error messages from the pyega3 run. The log file from the pyega3 run that yielded incomplete set of reads practically did not say anything.

Later, when we tried the read extraction locally using 'samtools' (version 1.9), the samtools command sometimes crashed with 'Segmentation faults' when using multiple cores. This crashing seemed to happen randomly. However, when using just a single core (which is, I believe, the default setting), the crashing issue didn't happen. Also, the latest samtools (version 1.12) doesn't seems to have this crashing issue with multiple cores.

So, I wonder the incomplete read-extraction by 'pyega3 fetch' is somehow related to the samtools issue?

JocelynSP commented 3 years ago

Log files of apparently successful downloads which are actually truncated: slurm-1351075.log slurm-1350121.log

malloryfreeberg commented 3 years ago

Hi @jungch and @JocelynSP - Thank you for reporting this issue. We will discuss internally the suggestion you raised, and we might also reach out separately to get some more details of what you've tried so we can try to reproduce what you've seen.

malloryfreeberg commented 3 years ago

Hi @jungch and @JocelynSP. Unfortunately we have been unable to reproduce this issue. I have followed up via email with a suggested next step.

malloryfreeberg commented 3 years ago

Hi @jungch and @JocelynSP. A quick update: We have been working on a reimplementation of the htsget protocol, which we hope will generally improve the fetching of genomic ranges. In the meantime, Daniel should be in touch to get you the files you need, from the list you shared via email previously. I will close this issue as this appears to be specific to the files you are requesting. Thanks!