Handling (or not handling) PacBio SMRT data

mpoelchau commented 3 years ago

@ShangYuChiang recently ran into an issue where the pipeline couldn't handle data from PacBio RS II (specifically, this experiment: https://www.ncbi.nlm.nih.gov/sra/SRR6023533). I believe it fails here (https://github.com/NAL-i5K/NAL_RNA_seq_annotation_pipeline/blob/master/rnannot/RNAseq_annotate.py#L109) since the following step (normalization) can't find the file output.fastq; and the trimmomatic error log reports Error: Unable to detect quality encoding

After discussing this, we'll want to do 2 things:

For now, prevent the pipeline from downloading this data type
Explore whether this data type is worth aligning for our use cases. @suryasaha suggests a separate workflow/track. We will need to pick this up again once we have capacity for it.

In order to do 1., we need to update download_sra_metadata.py to exclude records that have PACBIO_SMRT in the Platform field (probably using efilter). @ShangYuChiang do you think you can try this?

Error message for reference:

java.lang.RuntimeException: Can't find file /lustre/project/nal_genomics/shelly.chiang/rnaseq_Bacdor/2021-04-15-04-02-33/SRR6023533/output.fastq
        at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:906)
        at fileIO.ReadWrite.getInputStream(ReadWrite.java:871)
        at fileIO.FileFormat.getFirstOctet(FileFormat.java:429)
        at stream.FASTQ.isInterleaved(FASTQ.java:128)
        at stream.FastqReadInputStream.<init>(FastqReadInputStream.java:58)
        at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:121)
        at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:45)
        at bloom.KmerCount7MTA.count(KmerCount7MTA.java:375)
        at bloom.KmerCount7MTA.makeKca(KmerCount7MTA.java:276)
        at jgi.KmerNormalize.runPass(KmerNormalize.java:989)
        at jgi.KmerNormalize.main(KmerNormalize.java:675)
Made hash table:        hashes = 3       mem = 105.51 GB        cells = 56.65B          used = 0.000%

Estimated unique kmers:         0

Table creation time:            100.539 seconds.
Exception in thread "main" java.lang.RuntimeException: Can't find file /lustre/project/nal_genomics/shelly.chiang/rnaseq_Bacdor/2021-04-15-04-02-33/SRR6
023533/output.fastq
        at fileIO.ReadWrite.getRawInputStream(ReadWrite.java:906)
        at fileIO.ReadWrite.getInputStream(ReadWrite.java:871)
        at fileIO.FileFormat.getFirstOctet(FileFormat.java:429)
        at stream.FASTQ.isInterleaved(FASTQ.java:128)
        at stream.FastqReadInputStream.<init>(FastqReadInputStream.java:58)
        at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:121)
        at stream.ConcurrentReadInputStream.getReadInputStream(ConcurrentReadInputStream.java:45)
        at jgi.KmerNormalize.count(KmerNormalize.java:1143)
        at jgi.KmerNormalize.runPass(KmerNormalize.java:1099)
        at jgi.KmerNormalize.main(KmerNormalize.java:675)

and

Normalizing ...
Traceback (most recent call last):
  File "/lustre/project/nal_genomics/hsiukang/rnannot_venv/bin/RNAseq_annotate.py", line 4, in <module>
    __import__('pkg_resources').run_script('rnannot==0.0.1', 'RNAseq_annotate.py')
  File "/lustre/project/nal_genomics/hsiukang/rnannot_venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 665, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/lustre/project/nal_genomics/hsiukang/rnannot_venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 1463, in run_script
    exec(code, namespace, namespace)
  File "/lustre/project/nal_genomics/hsiukang/rnannot_venv/lib/python3.7/site-packages/rnannot-0.0.1-py3.7.egg/EGG-INFO/scripts/RNAseq_annotate.py", line 394, in <module>
    with open(fname, 'r') as infile:
FileNotFoundError: [Errno 2] No such file or directory: '/lustre/project/nal_genomics/shelly.chiang/rnaseq_Bacdor/2021-04-15-04-02-33/SRR6023533/normalized.fastq'

ShangYuChiang commented 3 years ago

Thank @monica for providing the detail about the error. I modified the subprocess of the "efilter" to exclude records that have PACBIO_SMRT in the Platform field. I also tested the download_sra_metadata.py with the command: "download_sra_metadata.py -t 27457 -o 27457.tsv " The modified download_sra_metadata.py is in the PACBIO branch #47

Thanks :D

mpoelchau commented 3 years ago

This is merged. closing.

NAL-i5K / NAL_RNA_seq_annotation_pipeline

Handling (or not handling) PacBio SMRT data #46