Using UMIs in the bcbio smallRNA pipeline

bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis

https://bcbio-nextgen.readthedocs.io

MIT License

994 stars 354 forks source link

Using UMIs in the bcbio smallRNA pipeline #2347

Closed mxhp75 closed 5 years ago

mxhp75 commented 6 years ago

Hi,

This is somewhat similar to #2070. We have sing end .fastq files with the following format:

@NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

where the bolded ATCACG = unique sample index and the bolded AACTGTAGGCACCATCAAT = 3' adapter

Following the 3' adapter is a 12 nt UMI. If I massage the .fastq file such that they are in the format:

@NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG:UMI_GACACCGAACGTAGA
TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

am I then able to add umi_type: fastq_name to the bcbio .yaml config and run through the small RNA pipeline? Is there a better way of doing this?

All advice gratefully received.

lpantano commented 6 years ago

Hi,

That is very interesting. And I am happy to help to get this compatible with this.

I’ll need to talk to @chapmanb to see how the small RNA can be compatible with umi_type option.

I understand that the tag in the read name are for the different samples.

I am guessing that using the UMI after the adapter is to quantify only once the sequences that share the same 12kmer umi, right?

I am asking because the pipeline works like this:

1-remove 3’ adapter trimming 2-collapsing equal reads into 1 and trace the number of time they appear

Do you want than for step 2, instead of counting the number of reads, counting the number of different 12kmer umis, that would be the expression and avoid the amplification bias?

Thanks

On Mar 21, 2018, at 11:12 PM, Melanie Smith notifications@github.com wrote:

Hi,

This is somewhat similar to #2070 https://github.com/bcbio/bcbio-nextgen/issues/2070. We have sing end .fastq files with the following format:

@nb500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

where the bolded ATCACG = unique sample index and the bolded AACTGTAGGCACCATCAAT = 3' adapter

Following the 3' adapter is a 12 nt UMI. If I massage the .fastq file such that they are in the format:

@NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG:UMI_GACACCGAACGTAGA TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE am I then able to add umi_type: fastq_name to the bcbio .yaml config and run through the small RNA pipeline? Is there a better way of doing this?

All advice gratefully received.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HP6pdRlcE-9RnDefyhBGcR2iUK7Yks5tgxahgaJpZM4S2YrW.

mxhp75 commented 6 years ago

Hi Lorena

Thank you for the quick response.

Yes, the use of the 12kmer UMI is so that we can collapse (or otherwise assign proportions) to reads which have been quantified more than once and reflect a technical artefact rather than a biological variance.

The plan for our data is to create a global miRNA expression profile in tissue A & B, differential expression between groups, and also to take advantage of the isomiR counts from the bcbio smallRNA pipeline.

I'm also interested in seeing what difference we get running through the bcbio smallRNA pipeline with and without the UMIs (if indeed it is possible) - in order to see what (if any) significant differences we find in the miRNA counts.

Regards

Melanie

lpantano commented 6 years ago

Hi Melanie,

I discussed with Rory and Brad, and the quickest and best way it would be to have a file per sample and the UMI in the read name. I can integrate this easily into the pipeline when collapsing reads to quantify only the difference UMI.

Do you think that would be feasible for you?

So mainly, if you input one file per sample, and the UMI in the read name as you specify, I think it can work. I’ll work on that and have something ready in the next days.

@roryk, can you help us to know how we can get that since I know is similar than the RNA-seq pipeline? The structure of the reads is like this: (Is this the standard way to put the UMI in the read name?)

@nb500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

where the bolded ATCACG = unique sample index and the bolded AACTGTAGGCACCATCAAT = 3' adapter

Following the 3' adapter is a 12 nt UMI. If I massage the .fastq file such that they are in the format:

@NB500965:105:HC5J5BGX2:1:11108:16467:3587 1:N:0:ATCACG:UMI_GACACCGAACGTAGA TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

Cheers

On Mar 22, 2018, at 11:33 PM, Melanie Smith notifications@github.com wrote:

Hi Lorena

Thank you for the quick response.

Yes, the use of the 12kmer UMI is so that we can collapse (or otherwise assign proportions) to reads which have been quantified more than once and reflect a technical artefact rather than a biological variance.

The plan for our data is to create a global miRNA expression profile in tissue A & B, differential expression between groups, and also to take advantage of the isomiR counts from the bcbio smallRNA pipeline.

I'm also interested in seeing what difference we get running through the bcbio smallRNA pipeline with and without the UMIs (if indeed it is possible) - in order to see what (if any) significant differences we find in the miRNA counts.

Regards

Melanie

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-375534136, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HFXs6oUC3sOTY-OrhwKF0PT3SexJks5thG0ggaJpZM4S2YrW.

mxhp75 commented 6 years ago

Hi Lorena

I have already merged the 4 (1 per lane) files so I have a single fast.gz per sample. I will work on a little tool to grab the umi and put it into the read name.

Thanks

Melanie

lpantano commented 6 years ago

Thanks.

If the read name has UMI_12KMERs, then I think it should be good to go if you update seqcluster with bcbio_conda binary:

bcbio_conda update seqcluster and make sure is installing the build 3.

I added the code to handle these and collapse twice the data to get the expression values related only to different UMIs.

Let me know if you find any issue because I couldn’t test this with real files.

Cheers

On Mar 23, 2018, at 11:50 PM, Melanie Smith notifications@github.com wrote:

Hi Lorena

I have already merged the 4 (1 per lane) files so I have a single fast.gz per sample. I will work on a little tool to grab the umi and put it into the read name.

Thanks

Melanie

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-375844320, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLqTZGyq-1g0CLL9WKbCTs0nd8ySks5thcKegaJpZM4S2YrW.

mshadbolt commented 6 years ago

Hi Lorena I am interested in trying this out as I just received a small non-coding rna dataset with UMIs. Do I just need to add a line umi_type: fastq_name to my yaml for it to work?

Will it work with any length of UMI? mine are actually 6bp

One of the tools I know that can do the processing of UMIs so that they are added to the read name is fastp, but it actually puts the UMI as follows so it would good it if it was also compatible with this:

@nb500965:105:HC5J5BGX2:1:11108:16467:3587:UMI_GACACCGAACGTAGA 1:N:0:ATCACG
TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE

lpantano commented 6 years ago

Hi,

Right now, if the samples are already split, so you have one file per sample, and the UMI is inside each fastq file, then the pipeline would quantify automatically the miRNAs using the UMI tag instead of all reads.

It looks for UMI_([ATGC]*) so it should be ok with your fastq files. So you don’t need to use the umi_type option right now.

You will have this message in the log file if it was detected:

Find UMI tags in read names, collapsing by UMI

Let me know if you find any issue.

Cheers

On Jun 22, 2018, at 12:05 PM, Marion notifications@github.com wrote:

Hi Lorena I am interested in trying this out as I just received a small non-coding rna dataset with UMIs. Do I just need to add a line umi_type: fastq_name to my yaml for it to work?

Will it work with any length of UMI? mine are actually 6bp

One of the tools I know that can do the processing of UMIs so that they are added to the read name is fastp, but it actually puts the UMI as follows so it would good it if it was also compatible with this:

@nb500965:105:HC5J5BGX2:1:11108:16467:3587:UMI_GACACCGAACGTAGA 1:N:0:ATCACG TTCAAGTAATCCAGGATAGGAACTGTAGGCACCATCAATGACACCGAACGTAGATCGGAAAGCACACGTCTGAACT + AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEAAEEE/EE — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-399493917, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLW5Cd3gukEsP4dkJTGeUpvpRCdEks5t_RXFgaJpZM4S2YrW.

mshadbolt commented 6 years ago

Hi Lorena, I get the following error since updating to the latest seqcluster to try to collapse the UMIs:

[2018-06-26T16:23Z] Skip trimming for: AACCCC_GSC_trimmed
Traceback (most recent call last):
  File "~/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in <module>
    main(**kwargs)
  File "~/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main
    fc_dir, run_info_yaml)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 320, in smallrnaseqpipeline
    samples = rnaseq_prep_samples(config, run_info_yaml, parallel, dirs, samples)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 410, in rnaseq_prep_samples
    samples = run_parallel("trim_srna_sample", samples)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper
    return apply(f, *args, **kwargs)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 107, in trim_srna_sample
    return srna.trim_srna_sample(*args)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 105, in trim_srna_sample
    data["collapse"] = _collapse(data["clean_fastq"])
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 190, in _collapse
    seqs = collapse(in_file)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/fastq.py", line 20, in collapse
    return collapse_umi(in_file)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/fastq.py", line 44, in collapse_umi
    keep[umis][1].update(qual)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/classes.py", line 61, in update
    self.qual = map(add, now, [ord(value) for value in q])
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

lpantano commented 6 years ago

HI,

Sorry about that. I guess that is something I coded and didn’t expect. I will need your help to find the issue.

Can you run this command with the sample is failing:

~/bcbio-nextgen/data/anaconda/bin/ipython

from collections import Counter, defaultdict from seqcluster.libs.classes import quality, umi from seqcluster.libs.fastq import open_fastq from itertools import product import gzip import re

in_file = “trimmed_fastq.gz"

keep = defaultdict(dict) with open_fastq(infile) as handle: for line in handle: if line.startswith("@"): m = re.search('UMI([ATGC]*)', line.strip()) umis = m.group(0) seq = handle.next().strip() handle.next() qual = handle.next().strip() print [line.strip(), seq, qual] if umis in keep: keep[umis][1].update(qual) keep[umis][0].update(seq) else: keep[umis] = [umi(seq), quality(qual)]

And send the last line is printed that it would be the line that is not having the information I assumed?

Thanks!

On Jun 26, 2018, at 9:28 AM, Marion notifications@github.com wrote:

Hi Lorena, I get the following error since updating to the latest seqcluster to try to collapse the UMIs:

[2018-06-26T16:23Z] Skip trimming for: AACCCC_GSC_trimmed Traceback (most recent call last): File "~/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in main(kwargs) File "~/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main fc_dir, run_info_yaml) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 320, in smallrnaseqpipeline samples = rnaseq_prep_samples(config, run_info_yaml, parallel, dirs, samples) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 410, in rnaseq_prep_samples samples = run_parallel("trim_srna_sample", samples) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items): File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in call while self.dispatch_one_batch(iterator): File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch self._dispatch(tasks) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async result = ImmediateResult(func) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in init self.results = batch() File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in call return [func(*args, *kwargs) for func, args, kwargs in self.items] File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper return apply(f, args, *kwargs) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 107, in trim_srna_sample return srna.trim_srna_sample(args) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 105, in trim_srna_sample data["collapse"] = _collapse(data["clean_fastq"]) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 190, in _collapse seqs = collapse(in_file) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/fastq.py", line 20, in collapse return collapse_umi(in_file) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/fastq.py", line 44, in collapse_umi keep[umis][1].update(qual) File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/classes.py", line 61, in update self.qual = map(add, now, [ord(value) for value in q]) TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-400377452, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HHdDIi8VUJmVFwy4KCiAhRrU_sDrks5uAmFKgaJpZM4S2YrW.

mshadbolt commented 6 years ago

Hi Lorena, is this the info you need:

['@NS2:51:HGCLNBGX5:1:11101:14292:13150:UMI_CCGGCT 1:N:0:AACCCC', 'TCTCCGGGATCTGTCGCGTTA', 'EEAAA/EAAE/<E<<EEEEEE']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-28e358503dba> in <module>()
     10             print [line.strip(), seq, qual]
     11             if umis in keep:
---> 12                 keep[umis][1].update(qual)
     13                 keep[umis][0].update(seq)
     14             else:

~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/classes.pyc in update(self, q, counts)
     59     def update(self, q, counts = 1):
     60         now = self.qual
---> 61         self.qual = map(add, now, [ord(value) for value in q])
     62         self.times += counts
     63 

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

lpantano commented 6 years ago

Thanks, I’ll make a fix tonight and push to bcbio. I’ll mention u in the commit so you know is ready.

On Jun 26, 2018, at 11:39 AM, Marion notifications@github.com wrote:

Hi Lorena, is this the info you need:

['@NS2:51:HGCLNBGX5:1:11101:14292:13150:UMI_CCGGCT 1:N:0:AACCCC', 'TCTCCGGGATCTGTCGCGTTA', 'EEAAA/EAAE/<E<<EEEEEE']

TypeError Traceback (most recent call last)
in () 10 print [line.strip(), seq, qual] 11 if umis in keep: ---> 12 keep[umis][1].update(qual) 13 keep[umis][0].update(seq) 14 else: ~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/seqcluster/libs/classes.pyc in update(self, q, counts) 59 def update(self, q, counts = 1): 60 now = self.qual ---> 61 self.qual = map(add, now, [ord(value) for value in q]) 62 self.times += counts 63 TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub , or mute the thread .

lpantano commented 6 years ago

Hi,

you can try to install seqcluster with bioconda: bcbio_conda install -c bioconda seqcluster, make sure to install 1.2.4a5 version.

fingers crossed! Let me know.

mshadbolt commented 6 years ago

Cool thanks, I will try it out and let you know.

mshadbolt commented 6 years ago

Oook, this time we got past the collapsing stage but now it seems like mirtop is complaining :(

[2018-06-28T21:07Z] Index BAM file: TTAGGC_GSC_trimmed.bam
[2018-06-28T21:07Z] Resource requests: miraligner, picard; memory: 3.60, 3.60; cores: 32, 32
[2018-06-28T21:07Z] Configuring 1 jobs to run, using 16 cores each with 57.7g of memory reserved for each job
[2018-06-28T21:07Z] Timing: small RNA annotation
[2018-06-28T21:07Z] multiprocessing: srna_annotation
[2018-06-28T21:07Z] Running razers3 against hairpins with ~/work/trimmed/AACCCC_GSC_trimmed/AACCCC_GSC_trimmed.clean.trimming.fq
[2018-06-28T21:10Z] Do miRNA annotation for ~/work/trimmed/AACCCC_GSC_trimmed/AACCCC_GSC_trimmed.clean.trimming.fq
[2018-06-28T21:10Z] Format is not tabular,guessing fasta
[2018-06-28T21:10Z] species found
[2018-06-28T21:10Z] Go to mapping...
[2018-06-28T21:10Z] Mismatches: 1
[2018-06-28T21:10Z] Trimming: 3
[2018-06-28T21:10Z] Addition: 3
[2018-06-28T21:10Z] Species: hsa
[2018-06-28T21:10Z] Thu Jun 28 14:10:55 PDT 2018
[2018-06-28T21:10Z] Reading reads
[2018-06-28T21:11Z] Number of reads to be mapped: 318521
[2018-06-28T21:11Z] Searching in precursors
[2018-06-28T21:11Z] Thu Jun 28 14:11:12 PDT 2018
[2018-06-28T21:11Z] Num reads annotated: 30543
[2018-06-28T21:11Z] Do miRNA annotation for ~/work/mirbase/AACCCC_GSC_trimmed/AACCCC_GSC_trimmed.mirna
[2018-06-28T21:11Z] INFO Run annotation
[2018-06-28T21:11Z] INFO Hits: 27457
Traceback (most recent call last):
  File "~/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 241, in <module>
    main(**kwargs)
  File "~/bcbio-nextgen/tools/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main
    fc_dir, run_info_yaml)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 333, in smallrnaseqpipeline
    samples = run_parallel("srna_annotation", samples)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 779, in __call__
    while self.dispatch_one_batch(iterator):
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 625, in dispatch_one_batch
    self._dispatch(tasks)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 588, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 111, in apply_async
    result = ImmediateResult(func)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 332, in __init__
    self.results = batch()
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper
    return apply(f, *args, **kwargs)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 156, in srna_annotation
    return srna.sample_annotation(*args)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 134, in sample_annotation
    data['config'])
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/srna/sample.py", line 270, in _mirtop
    do.run(cmd.format(**locals()), "Do miRNA annotation for %s" % input_fn)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 23, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/bcbio/provenance/do.py", line 103, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; unset JAVA_HOME  && export PATH=~/bcbio-nextgen/data/anaconda/bin:$PATH &&  mirtop gff  --sps hsa --hairpin ~/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/hairpin.fa --gtf ~/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/mirbase.gff3 --format seqbuster -o ~/work/bcbiotx/tmp92mE4p ~/work/mirbase/AACCCC_GSC_trimmed/AACCCC_GSC_trimmed.mirna
INFO Run annotation
INFO Hits: 27457
INFO Valid hits (+/-3 reference miRNA): 27762
['gff', '--sps', 'hsa', '--hairpin', '~/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/hairpin.fa', '--gtf', '~/bcbio-nextgen/data/genomes/Hsapiens/hg38-noalt/srnaseq/mirbase.gff3', '--format', 'seqbuster', '-o', '~/work/bcbiotx/tmp92mE4p', '~/work/mirbase/AACCCC_GSC_trimmed/AACCCC_GSC_trimmed.mirna']
Traceback (most recent call last):
  File "~/bcbio-nextgen/data/anaconda/bin/mirtop", line 6, in <module>
    sys.exit(mirtop.command_line.main())
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/mirtop/command_line.py", line 27, in main
    reader(kwargs["args"])
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/mirtop/gff/__init__.py", line 50, in reader
    out_dts[fn] = body.create(ann, database, sample, args)
  File "~/bcbio-nextgen/data/anaconda/lib/python2.7/site-packages/mirtop/gff/body.py", line 90, in create
    logger.warning("Same isomir %s from different sequence: \n%s and \n%s" % (annotation, res, seen_ann[annotation]))
NameError: global name 'res' is not defined
' returned non-zero exit status 1

mshadbolt commented 6 years ago

I also just had a quick look at the .mirna file for this sample and every detected isomir has a 1 read count, which I don't think is correct so maybe the UMI collapsing isn't performing correctly?

lpantano commented 6 years ago

Thanks for the information. I think that fixing the bug with jet lag wasn’t a good idea. I’ll try to fix it for real tonight. I’ll let you know.

On Jun 28, 2018, at 2:20 PM, Marion notifications@github.com wrote:

I also just had a quick look at the .mirna file for this sample and every detected isomir has a 1 read count, which I don't think is correct so maybe the UMI collapsing isn't performing correctly?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-401177027, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HHSuf_M6NIa9xvRpMb0j8cQHL0yYks5uBUiogaJpZM4S2YrW.

lpantano commented 6 years ago

Hi,

can you try to update seqcluster with this command:

bcbio_pip install --upgrade --no-deps git+https://github.com/lpantano/seqcluster.git#egg=seqcluster

and start from the scratch?

as well can you tell me what mirtop version you have installed in bcbio conda? looking for the other error.

Thanks!

mshadbolt commented 6 years ago

Hi Lorena

I will try again after reinstalling and let you know how I go. The mirtop version I'm using is mirtop-0.3.11a0-py27_0

mshadbolt commented 6 years ago

tried the update and re ran from scratch but get the exact same error as above and mirna file still has all single read frequencies.

lpantano commented 6 years ago

Hi,

Can you check the seqcluster version?

It should be: 1.2.4a7, I updated bioconda so maybe you can try to update using the bcbio_conda command.

It that doesn’t work, do you think is possible to share one file with me?

I don’t have any and it is difficult to debug this way.

Let me know if you can share part of the file,

Sorry for the issues. Hopefully I can fix it soon.

Cheers

On Jun 29, 2018, at 3:49 PM, Marion notifications@github.com wrote:

tried the update and re ran from scratch but get the exact same error as above and mirna file still has all single read frequencies.

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-401492604, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HDgnBNlV_o-dYrV21uAoeKt8mWyPks5uBq8JgaJpZM4S2YrW.

rbatorsky commented 6 years ago

Hello,

I have a similar situation where I have a UMI in my small RNAseq data (from QIAseq miRNA prep). Do I understand correctly from the thread that the smallRNA-seq pipeline can handle collapsing reads with the same UMI? The dataset I'm working with is canine, which has miRNA transcripts in mirbase22.

My single end fastq reads look like this:

@D00780:267:CCGFFANXX:1:1102:1381:1969 1:N:0:CAGATC GTATCCGTCGTGGAACTGTAGGCACCATCAATTTAATCCAATCAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTT + <BBBBFFFFFFFFFFFFFFFFFFFF<FFFFFFFFBFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF<FF

Working from the left: the first bold from the left is a 3' adapter, followed by a 12 bp UMI, and the second bold section is the second 3' adapter.

I'm not sure how to use the adapter trimming implemented in the small RNA pipeline, or if the suggestion is to preprocess my reads to have the UMI in the read name and skip adapter trimming in the pipeline, to look like this:

@D00780:267:CCGFFANXX:1:1102:1381:1969:UMI_TTAATCCAATCA 1:N:0:CAGATC GTATCCGTCGTGG + <BBBBFFFFFFFF

Thank you for any advice. Rebecca

lpantano commented 6 years ago

Hello Rebecca,

the small RNA pipeline will handle that if you pre-process your reads to have it as you described at the end of your email.

It would be good to have this step implemented in bcbio for this pipeline but is not the case yet, sadly :(

You can try if the files you generate works using the seqcluster collapse command that should be installed together with bcbio inside the conda folder.

I don't have any example to test, so it would be great if you can test that. If I had the files I would do:

seqcluster collapse -f fastq_file -o outfolder (original fastq)

and

seqcluster collapse -f umi_fastq_file -o outfoulder (UMI in the read name)

and compare the counts you get for the same read.

If things make sense, then you can use bcbio with the processed fastq files and it should work.

let me know what happens!

Thanks

On Fri, Oct 12, 2018 at 1:21 PM rbatorsky-claritas notifications@github.com wrote:

Hello,

I have a similar situation where I have a UMI in my small RNAseq data (from QIAseq miRNA prep). Do I understand correctly from the thread that the smallRNA-seq pipeline can handle collapsing reads with the same UMI? The dataset I'm working with is canine, which has miRNA transcripts in mirbase22.

My single end fastq reads look like this:

@D00780:267:CCGFFANXX:1:1102:1381:1969 1:N:0:CAGATC GTATCCGTCGTGGAACTGTAGGCACCATCAATTTAATCCAATCAAGATCGGAAGAGCACACGTCT GAACTCCAGTCACCAGATCATCTCGTATGCCGTCTT +

<BBBBFFFFFFFFFFFFFFFFFFFF<FFFFFFFFBFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF<FF

Working from the left: the first bold from the left is a 3' adapter, followed by a 12 bp UMI, and the second bold section is the second 3' adapter.

I'm not sure how to use the adapter trimming implemented in the small RNA pipeline, or if the suggestion is to preprocess my reads to have the UMI in the read name and skip adapter trimming in the pipeline.

@D00780:267:CCGFFANXX:1:1102:1381:1969:UMI_TTAATCCAATCA 1:N:0:CAGATC GTATCCGTCGTGG + <BBBBFFFFFFFF

Thank you for any advice. Rebecca

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-429398895, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HMgB_SrjK84nVyGXgdXv96sxqSNvks5ukM-zgaJpZM4S2YrW .

rbatorsky commented 6 years ago

Hello and thanks! I believe the seqcluster step did handle the UMI in the fasta header.

With the original fasta I get:

INFO Run collapse
INFO Sequences loaded: 0
INFO Writing sequences to input-trim_trimmed.fastq
INFO It took 0.000 minutes

With the fasta with UMI in the read name I get:

INFO Run collapse
INFO Find UMI tags in read names, collapsing by UMI.
INFO Sequences loaded: 1595450
INFO Writing sequences to input-trim-umi_trimmed.fastq
INFO It took 34.924 minutes

However, when I ran the small RNA pipeline, I didn't get any outputs in the mirdeep2 or mirbase directories. This is the config file

upload:
  dir: final
details:
  - files: [/cluster/tufts/rt/rbator01/work/test_bcbio_small_rna/trim-umi.fastq.gz]
    description: test
    analysis: smallRNA-seq
    algorithm:
      aligner: star # any other aligner is supported.
      expression_caller: [trna, seqcluster, mirdeep2]
      species: cfa
    genome_build: canFam3

I do see this message in the logs:

[2018-10-18T13:58Z] Create seqbuster count data at /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio_small_rna/mirbase.
[2018-10-18T13:58Z] WARNING::any samples have miRNA annotated for seqbuster. Check if fasta files is small or species value.
[2018-10-18T13:58Z] Create seqbuster_novel count data at /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio_small_rna/mirdeep2.
[2018-10-18T13:58Z] WARNING::any samples have miRNA annotated for seqbuster_novel. Check if fasta files is small or species value.

and lower down

[2018-10-18T13:31Z] Preparing for mirdeep2 analysis.
[2018-10-18T13:31Z] mirdeep2 Rfam file not instaled. Skipping...

So it seems that some resources are missing for the mirbase analysis. Is there maybe an installation step that I'm missing? I've done bcbio_nextgen.py upgrade --datatarget smallrna.

Thanks again!

lpantano commented 6 years ago

Hi,

Thanks for checking that, glad it seems to be working.

I didn't set up an installation for that species: canFam3. So probably that is a problem. I can do that, but I will need to wait until Monday.

Another favor to ask. Just to make sure the input is ok, can you send me a few lines of the input file you give. I see you did the trimming and umi identification before bcbio, did you use a specific tool for the UMI identification and modification of the fastq file?

Cheers

On Thu, Oct 18, 2018 at 4:44 PM rbatorsky-claritas notifications@github.com wrote:

Hello and thanks! I believe the seqcluster step did handle the UMI in the fasta header.

With the original fasta I get:

INFO Run collapse INFO Sequences loaded: 0 INFO Writing sequences to input-trim_trimmed.fastq INFO It took 0.000 minutes

With the fasta with UMI in the read name I get:

INFO Run collapse INFO Find UMI tags in read names, collapsing by UMI. INFO Sequences loaded: 1595450 INFO Writing sequences to input-trim-umi_trimmed.fastq INFO It took 34.924 minutes

However, when I ran the small RNA pipeline, I didn't get any outputs in the mirdeep2 or mirbase directories. This is the config file

upload: dir: final details:

files: [/cluster/tufts/rt/rbator01/work/test_bcbio_small_rna/trim-umi.fastq.gz] description: test analysis: smallRNA-seq algorithm: aligner: star # any other aligner is supported. expression_caller: [trna, seqcluster, mirdeep2] species: cfa genome_build: canFam3

I do see this message in the logs:

[2018-10-18T13:58Z] Create seqbuster count data at /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio_small_rna/mirbase. [2018-10-18T13:58Z] WARNING::any samples have miRNA annotated for seqbuster. Check if fasta files is small or species value. [2018-10-18T13:58Z] Create seqbuster_novel count data at /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio_small_rna/mirdeep2. [2018-10-18T13:58Z] WARNING::any samples have miRNA annotated for seqbuster_novel. Check if fasta files is small or species value.

and lower down

[2018-10-18T13:31Z] Preparing for mirdeep2 analysis. [2018-10-18T13:31Z] mirdeep2 Rfam file not instaled. Skipping...

So it seems that some resources are missing for the mirbase analysis. Is there maybe an installation step that I'm missing? I've done bcbio_nextgen.py upgrade --datatarget smallrna.

Thanks again!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-431154021, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HMovZCcYOgnqftI1wjK3QJbOdhiRks5umOhBgaJpZM4S2YrW .

rbatorsky commented 6 years ago

Hi and thanks again!

I went to fast last time, I think seqcluster may not be collapsing the UMI in read names correctly.

The reads that seqcluster generates always have _x1, meaning they are found only once. I tried adding the UMI_ label in both the read name and tag and get the same result, i.e.

@D00780:267:CCGFFANXX:1:1102:2846:1990 1:N:0:CAGATC:UMI_GCGACGCTTACA
TCCTGTACTGAGCTGCCCCGAGT

and

@D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA 1:N:0:CAGATC
TCCTGTACTGAGCTGCCCCGAGT

That specific sequence is actually a canine mir-486 and I find 275 reads in my input fastq that have the same UMI and the same mirna sequence, but the seqcluster is reporting them separately.

To answer your questions: 1) I used cutadapt to trim the first 3' adapter and this script to extract the UMI into read name. I modified it a few times to move the UMI label around in the read name: https://github.com/HuntsmanCancerInstitute/UMIScripts/blob/master/bin/qiagen_smallRNA_umi_extractor.pl

2) It would be great if you could add the canine miRNA information!

3) Please see the attached with some example fastq. example.fastq.txt

Thank you again so much!

lpantano commented 6 years ago

Thanks for all the information.

Can you check the version of the installed seqcluster: bcbio_conda list | grep seqcluster

Any chance you can share with me part of the input file that contains repeated sequences with the same and different UMI . to debug the issue?

If you could extract the example you mention mir-486, it would be helpful and I could dig into this.

so sorry for the issue.

Cheers

On Fri, Oct 19, 2018 at 2:18 PM rbatorsky-claritas notifications@github.com wrote:

Hi and thanks again!

I went to fast last time, I think seqcluster may not be collapsing the UMI in read names correctly.

The reads that seqcluster generates always have x1, meaning they are found only once. I tried adding the UMI label in both the read name and tag and get the same result, i.e.

@D00780:267:CCGFFANXX:1:1102:2846:1990 1:N:0:CAGATC:UMI_GCGACGCTTACA TCCTGTACTGAGCTGCCCCGAGT

and

@D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA 1:N:0:CAGATC TCCTGTACTGAGCTGCCCCGAGT

That specific sequence is actually a canine mir-486 and I find 275 reads in my input fastq that have the same UMI and the same mirna sequence, but the seqcluster is reporting them separately.

To answer your questions:

1.

I used cutadapt to trim the first 3' adapter and this script to extract the read name. I modified it a few times to move the UMI label around in the read name:

https://github.com/HuntsmanCancerInstitute/UMIScripts/blob/master/bin/qiagen_smallRNA_umi_extractor.pl 2.

It would be great if you could add the canine miRNA information! 3.

Please see the attached with some example fastq. example.fastq.txt https://github.com/bcbio/bcbio-nextgen/files/2496869/example.fastq.txt

Thank you again so much!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-431452866, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLzaTAjd8ls6eKHfe3kXIK5cgoPRks5umhePgaJpZM4S2YrW .

lpantano commented 6 years ago

For instance, with this file and that cmd:

https://gist.github.com/lpantano/a9977125473a405e0b1aeb72d4e7d962

I get the correct example_trimmed.fastq identifying 3 sequences that are the same.

The version I used 1.2.4a6

On Fri, Oct 19, 2018 at 3:17 PM Lorena Pantano lorena.pantano@gmail.com wrote:

Thanks for all the information.

Can you check the version of the installed seqcluster: bcbio_conda list | grep seqcluster

Any chance you can share with me part of the input file that contains repeated sequences with the same and different UMI . to debug the issue?

If you could extract the example you mention mir-486, it would be helpful and I could dig into this.

so sorry for the issue.

Cheers

On Fri, Oct 19, 2018 at 2:18 PM rbatorsky-claritas < notifications@github.com> wrote:

Hi and thanks again!

I went to fast last time, I think seqcluster may not be collapsing the UMI in read names correctly.

The reads that seqcluster generates always have x1, meaning they are found only once. I tried adding the UMI label in both the read name and tag and get the same result, i.e.

@D00780:267:CCGFFANXX:1:1102:2846:1990 1:N:0:CAGATC:UMI_GCGACGCTTACA TCCTGTACTGAGCTGCCCCGAGT

and

@D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA 1:N:0:CAGATC TCCTGTACTGAGCTGCCCCGAGT

That specific sequence is actually a canine mir-486 and I find 275 reads in my input fastq that have the same UMI and the same mirna sequence, but the seqcluster is reporting them separately.

To answer your questions:

1.

I used cutadapt to trim the first 3' adapter and this script to extract the read name. I modified it a few times to move the UMI label around in the read name:

https://github.com/HuntsmanCancerInstitute/UMIScripts/blob/master/bin/qiagen_smallRNA_umi_extractor.pl 2.

It would be great if you could add the canine miRNA information! 3.

Please see the attached with some example fastq. example.fastq.txt https://github.com/bcbio/bcbio-nextgen/files/2496869/example.fastq.txt

Thank you again so much!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-431452866, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLzaTAjd8ls6eKHfe3kXIK5cgoPRks5umhePgaJpZM4S2YrW .

rbatorsky commented 6 years ago

Thanks Lorena!

I’m also using seqcluster version 1.2.4a6.

I’ve created some dummy data with human mirna and UMI labels to try and figure out what’s going on and avoid the problem with canine not being completely supported.

dummy_human_data.fastq.gz

The data has three different UMI and two different human mirna sequences.

I’m having two separate issues: 1) seqcluster gives different behavior for fastq and fastq.gz files 2) bcbio is crashing on fastq and fastq.gz files that have UMI tags in the read names

Problem 1) when I run: seqcluster collapse -f dummy_human_data.fa -o test_dummy_fa

I get the log file

['collapse', '-f', 'dummy_human_data.fa', '-o', 'test_dummy_fa']
INFO Run collapse
INFO Find UMI tags in read names, collapsing by UMI.
INFO Sequences loaded: 6
INFO Writing sequences to test_dummy_fa/dummy_human_data_umi_trimmed.fastq
INFO Sequences loaded: 3
INFO Writing sequences to test_dummy_fa/dummy_human_data_trimmed.fastq
INFO It took 0.000 minutes

and two output files:

ls test_dummy_fa/
dummy_human_data_trimmed.fastq  dummy_human_data_umi_trimmed.fastq  log

one of which uses the umi:

cat test_dummy_fa/dummy_human_data_trimmed.fastq 
@seq_1_x3
TGAGGTAGTAGGTTGTATAGTT
+
BBBBBFFDDFF>FFFFFFFFFF
@seq_2_x3
CTATACAATCTACTGTCTTTC
+
BBBBBFFFFFFFFFFFFFFFD

and one that doesn’t:

cat test_dummy_fa/dummy_human_data_umi_trimmed.fastq 
TGAGGTAGTAGGTTGTATAGTT
+
BBBBBFFFFFFFFFFFFFFFFF
@seq_2_x1
CTATACAATCTACTGTCTTTC
+
BBBBBFFFFFFFFFFFFFFFF
@seq_3_x1
CTATACAATCTACTGTCTTTC
+
BBBBBFFFFFFFFFFFFFFFF
@seq_4_x1
TGAGGTAGTAGGTTGTATAGTT
+
BBBBBFFFFFF/FFFFFFFFFF
@seq_5_x1
TGAGGTAGTAGGTTGTATAGTT
+
BBBBBFFBBFFFFFFFFFFFFF
@seq_6_x1
CTATACAATCTACTGTCTTTC
+
BBBBBFFFFFFFFFFFFFFFB

Oddly enough when I do the same thing with the gzipped file, I only get the second file:

seqcluster collapse -f dummy_human_data.fa.gz -o test_dummy_fagz
['collapse', '-f', 'dummy_human_data.fa.gz', '-o', 'test_dummy_fagz']
INFO Run collapse
INFO Find UMI tags in read names, collapsing by UMI.
INFO Sequences loaded: 6
INFO Writing sequences to test_dummy_fagz/dummy_human_data_trimmed.fastq
INFO It took 0.000 minutes

ls test_dummy_fagz
dummy_human_data_trimmed.fastq  log

I'm not sure if that is expected behavior or not, but at least it explains why didn't see the UMI being used (I was using a gz file).

My real problem is that bcbio crashes with either of these files with UMI in the read name, but is able to get (almost) to the end of the analysis without UMI in the read names.

Here is my template:

upload:
  dir: final
details:
  - files: [/cluster/tufts/rt/rbator01/work/test_fagz/dummy_human_data.fa.gz]
    description: dummydata_fagz
    analysis: smallRNA-seq
    algorithm:
      aligner: star 
      expression_caller: [trna, seqcluster, mirdeep2]
      species: hsa
    genome_build: GRCh37

and the result:

bcbio_nextgen.py template-srnaseq-fagz.yaml
[2018-10-22T14:24Z] System YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/tools/bcbio/galaxy/bcbio_system.yaml
[2018-10-22T14:24Z] Resource requests: atropos, picard; memory: 4.00, 4.00; cores: 16, 16
[2018-10-22T14:24Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job
[2018-10-22T14:24Z] Timing: organize samples
[2018-10-22T14:24Z] multiprocessing: organize_samples
[2018-10-22T14:24Z] Using input YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/template-srnaseq-fagz.yaml
[2018-10-22T14:24Z] Checking sample YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/template-srnaseq-fagz.yaml
/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/gffutils/interface.py:161: UserWarning: It appears that this database has not had the ANALYZE sqlite3 command run on it. Doing so can dramatically speed up queries, and is done by default for databases created with gffutils >0.8.7.1 (this database was created with version 0.8.6.1) Consider calling the analyze() method of this object.
  "method of this object." % self.version)
[2018-10-22T14:24Z] Testing minimum versions of installed programs
[2018-10-22T14:24Z] multiprocessing: prepare_sample
[2018-10-22T14:24Z] Preparing dummydata_fagz
[2018-10-22T14:24Z] Timing: adapter trimming
[2018-10-22T14:24Z] multiprocessing: trim_srna_sample
[2018-10-22T14:24Z] Skip trimming for: dummydata_fagz
[2018-10-22T14:24Z] Collapsing /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/trimmed/dummydata_fagz/dummy_human_data.clean.fa.gz with --min_size 16 --min 1
[2018-10-22T14:24Z] Resource requests: picard, samtools, star; memory: 4.00, 4.00, 4.00; cores: 16, 16, 16
[2018-10-22T14:24Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job
[2018-10-22T14:24Z] Timing: prepare
[2018-10-22T14:24Z] multiprocessing: seqcluster_prepare
[2018-10-22T14:24Z] Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1
[2018-10-22T14:24Z] Timing: alignment
[2018-10-22T14:24Z] multiprocessing: srna_alignment
Traceback (most recent call last):
  File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 242, in <module>
    main(**kwargs)
  File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 47, in main
    run_main(**kwargs)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main
    fc_dir, run_info_yaml)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 327, in smallrnaseqpipeline
    samples = run_parallel("srna_alignment", [samples])
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper
    return apply(f, *args, **kwargs)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 168, in srna_alignment
    return seqcluster.run_align(*args)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/srna/group.py", line 70, in run_align
    sample = process_alignment(data[0][0], [seq_out, None])
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 180, in process_alignment
    "If it is a fastq file (not pre-aligned BAM or CRAM), "
ValueError: Could not process input file from sample configuration. 
/cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/seqcluster/prepare/seqs.fastq
Is the path to the file correct or is empty?

The file it is complaining about is indeed empty:

/cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/seqcluster/prepare/seqs.fastq

If I remove the UMI tags it gets much further in the analysis and complains about too few sequences, which I’m guessing is just because it’s dummy data.

INFO Total genome with sequences: 3.48184e-08 
INFO Parsing matrix file
INFO counts after: 56
INFO # sequences after: 3
ERROR It seems you have low coverage. Please check your fastq files have enough sequences.

Thanks for any advice!

lpantano commented 6 years ago

Thanks a lot! I think I know what is going on. I’ll push a fix tomorrow and for the dog data as well.

Cheers

On Oct 22, 2018, at 22:43, rbatorsky-claritas notifications@github.com wrote:

Thanks Lorena!

I’m also using seqcluster version 1.2.4a6.

I’ve created some dummy data with human mirna and UMI labels to try and figure out what’s going on and avoid the problem with canine not being completely supported.

dummy_human_data.fastq.gz

The data has three different UMI and two different human mirna sequences.

I’m having two separate issues:

seqcluster gives different behavior for fastq and fastq.gz files bcbio is crashing on fastq and fastq.gz files that have UMI tags in the read names Problem 1) when I run: seqcluster collapse -f dummy_human_data.fa -o test_dummy_fa

I get the log file

['collapse', '-f', 'dummy_human_data.fa', '-o', 'test_dummy_fa'] INFO Run collapse INFO Find UMI tags in read names, collapsing by UMI. INFO Sequences loaded: 6 INFO Writing sequences to test_dummy_fa/dummy_human_data_umi_trimmed.fastq INFO Sequences loaded: 3 INFO Writing sequences to test_dummy_fa/dummy_human_data_trimmed.fastq INFO It took 0.000 minutes and two output files:

ls test_dummy_fa/ dummy_human_data_trimmed.fastq dummy_human_data_umi_trimmed.fastq log one of which uses the umi:

cat test_dummy_fa/dummy_human_data_trimmed.fastq @seq_1_x3 TGAGGTAGTAGGTTGTATAGTT + BBBBBFFDDFF>FFFFFFFFFF @seq_2_x3 CTATACAATCTACTGTCTTTC + BBBBBFFFFFFFFFFFFFFFD and one that doesn’t:

cat test_dummy_fa/dummy_human_data_umi_trimmed.fastq TGAGGTAGTAGGTTGTATAGTT + BBBBBFFFFFFFFFFFFFFFFF @seq_2_x1 CTATACAATCTACTGTCTTTC + BBBBBFFFFFFFFFFFFFFFF @seq_3_x1 CTATACAATCTACTGTCTTTC + BBBBBFFFFFFFFFFFFFFFF @seq_4_x1 TGAGGTAGTAGGTTGTATAGTT + BBBBBFFFFFF/FFFFFFFFFF @seq_5_x1 TGAGGTAGTAGGTTGTATAGTT + BBBBBFFBBFFFFFFFFFFFFF @seq_6_x1 CTATACAATCTACTGTCTTTC + BBBBBFFFFFFFFFFFFFFFB Oddly enough when I do the same thing with the gzipped file, I only get the second file:

seqcluster collapse -f dummy_human_data.fa.gz -o test_dummy_fagz ['collapse', '-f', 'dummy_human_data.fa.gz', '-o', 'test_dummy_fagz'] INFO Run collapse INFO Find UMI tags in read names, collapsing by UMI. INFO Sequences loaded: 6 INFO Writing sequences to test_dummy_fagz/dummy_human_data_trimmed.fastq INFO It took 0.000 minutes ls test_dummy_fagz dummy_human_data_trimmed.fastq log I'm not sure if that is expected behavior or not, but at least it explains why didn't see the UMI being used (I was using a gz file).

My real problem is that bcbio crashes with either of these files with UMI in the read name, but is able to get (almost) to the end of the analysis without UMI in the read names.

Here is my template:

upload: dir: final details:

files: [/cluster/tufts/rt/rbator01/work/test_fagz/dummy_human_data.fa.gz] description: dummydata_fagz analysis: smallRNA-seq algorithm: aligner: star expression_caller: [trna, seqcluster, mirdeep2] species: hsa genome_build: GRCh37 and the result:

bcbio_nextgen.py template-srnaseq-fagz.yaml [2018-10-22T14:24Z] System YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/tools/bcbio/galaxy/bcbio_system.yaml [2018-10-22T14:24Z] Resource requests: atropos, picard; memory: 4.00, 4.00; cores: 16, 16 [2018-10-22T14:24Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job [2018-10-22T14:24Z] Timing: organize samples [2018-10-22T14:24Z] multiprocessing: organize_samples [2018-10-22T14:24Z] Using input YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/template-srnaseq-fagz.yaml [2018-10-22T14:24Z] Checking sample YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/template-srnaseq-fagz.yaml /cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/gffutils/interface.py:161: UserWarning: It appears that this database has not had the ANALYZE sqlite3 command run on it. Doing so can dramatically speed up queries, and is done by default for databases created with gffutils >0.8.7.1 (this database was created with version 0.8.6.1) Consider calling the analyze() method of this object. "method of this object." % self.version) [2018-10-22T14:24Z] Testing minimum versions of installed programs [2018-10-22T14:24Z] multiprocessing: prepare_sample [2018-10-22T14:24Z] Preparing dummydata_fagz [2018-10-22T14:24Z] Timing: adapter trimming [2018-10-22T14:24Z] multiprocessing: trim_srna_sample [2018-10-22T14:24Z] Skip trimming for: dummydata_fagz [2018-10-22T14:24Z] Collapsing /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/trimmed/dummydata_fagz/dummy_human_data.clean.fa.gz with --min_size 16 --min 1 [2018-10-22T14:24Z] Resource requests: picard, samtools, star; memory: 4.00, 4.00, 4.00; cores: 16, 16, 16 [2018-10-22T14:24Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job [2018-10-22T14:24Z] Timing: prepare [2018-10-22T14:24Z] multiprocessing: seqcluster_prepare [2018-10-22T14:24Z] Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1 [2018-10-22T14:24Z] Timing: alignment [2018-10-22T14:24Z] multiprocessing: srna_alignment Traceback (most recent call last): File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 242, in main(kwargs) File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 47, in main run_main(kwargs) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main fc_dir, run_info_yaml) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 327, in smallrnaseqpipeline samples = run_parallel("srna_alignment", [samples]) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items): File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 983, in call if self.dispatch_one_batch(iterator): File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 825, in dispatch_one_batch self._dispatch(tasks) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 782, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 545, in init self.results = batch() File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 261, in call for func, args, kwargs in self.items] File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper return apply(f, *args, *kwargs) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 168, in srna_alignment return seqcluster.run_align(args) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/srna/group.py", line 70, in run_align sample = process_alignment(data[0][0], [seq_out, None]) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 180, in process_alignment "If it is a fastq file (not pre-aligned BAM or CRAM), " ValueError: Could not process input file from sample configuration. /cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/seqcluster/prepare/seqs.fastq Is the path to the file correct or is empty? The file it is complaining about is indeed empty:

/cluster/kappa/90-days-archive/rt/rbator01/work/test_fagz/seqcluster/prepare/seqs.fastq If I remove the UMI tags it gets much further in the analysis and complains about too few sequences, which I’m guessing is just because it’s dummy data.

INFO Total genome with sequences: 3.48184e-08 INFO Parsing matrix file INFO counts after: 56 INFO # sequences after: 3 ERROR It seems you have low coverage. Please check your fastq files have enough sequences. Thanks for any advice!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

lpantano commented 6 years ago

Hi,

I updated the data for canFam3. You would need to update data again as you tried before.

bcbio_nextgen.py upgrade --datatarget smallrna.

To double check is updated, you can check the folder where the canFam3 genome is and check there is a srnaseq folder.

I think I fixed the issue with the GZ file. I am trying to update the bioconda recipe. You can try to install seqcluster from pip with bcbio_pip install seqcluster and check you get the last version 1.2.4a12.

https://pypi.org/project/seqcluster/1.2.4a12/

I'll let you know when the bioconda install is ready, so you can update with bcbio_conda install seqcluster if pypi gives issues.

Cheers

rbatorsky commented 6 years ago

Hi,

I was able to update the data but I'm not sure how to update the seqcluster used by bcbio. Below is the message I get. I can wait for the conda update, no problem.

bcbio_pip install seqcluster==1.2.4a12
Collecting seqcluster==1.2.4a12
  Using cached https://files.pythonhosted.org/packages/a9/77/89a7b3b46aae3e92622189612825644a03399798906c78875da555d06d5e/seqcluster-1.2.4a12.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/cluster/kappa/90-days-archive/rt/rbator01/tmp/pip-install-iekiI2/seqcluster/setup.py", line 11, in <module>
        with open("requirements.txt", "r") as f:
    IOError: [Errno 2] No such file or directory: 'requirements.txt'

    ----------------------------------------

Just a note - I'm not sure that the gz is causing the second error I posted above since I get the same error running on the unzipped dummy_human_data.fastq as I do with the zipped version. As long as the UMI included in the read tag I seem to get empty file in this directory created during the run:

seqcluster/prepare/seqs.fastq

I hope it's not a problem with my installation. Just curious, are you able to get bcbio to generate that file using my test file above?

Thanks again!

lpantano commented 6 years ago

Hi,

I think you can update seqcluster with bcbio_conda install seqcluster -c bioconda -f.

I tested your dummy file (thanks!) and it is working. It will fail because there is so few sequences but is annotating miRNAs.

As well, your files are already trimmed as well?

We are working to support this protocol in bcbio, thanks for pointing to the script.

Hopefully, we'll have it soon!

Thanks!

mxhp75 commented 6 years ago

Hi Lorena

I am working with human small RNA sequence data and have tested the seqcluster collapse using dummy data.

> seqcluster collapse -f Fastq_file -o outfolder (original file) and > seqcluster collapse -f umi_fastq_file -o outfolder (UMI in read name) I am returned results as expect ie for the "normal" file each sequence is appended with @seq_1_x1, @seq_2_x1 etc, and for the UMI file I get @seq_1_x4, @seq_2_x1 where I purposefully added 4 copies of read 1 with the same UMI.

Am I able to try the full bcbio small RNA pipeline? Do I need to add anything to the template .yaml (ie "tell" bcbio I'm using the UMI) - or just using the files that include the UMI in the header is enough?

I am very excited about seeing how the results including the UMI compare to running without. I have 96 samples in total that I can run and then cross check the results.

Melanie

lpantano commented 6 years ago

Hi Melanie,

I would suggest to check that you have installed the last version of seqcluster:

bcbio_conda list | grep seqcluster should be 1.2.4a12.

If that is true, I would suggest you try the pipeline. You don't need to specify anything else, you need to make sure the name read has the UMI in this format UMI_GCGACGCTTACA. If your files are already trimmed you don't need to tell the 3' adapter, if you need to trim them, you need to tell the 3' adapter in adapters: [$adapter_sequence].

Let me know what happens.

lpantano commented 6 years ago

Hi all,

I am sorry to tell that this is still not working but the good news is that I found the bug.

I’ll push tomorrow a fix that will go with support the qiagen umi protocol as well so you don’t need to process your input files.

Thanks for the help on that, I will post here when is done.

cheers

lpantano commented 6 years ago

Hi @rbatorsky-claritas and @mxhp75 ,

I think I fixed it (1d54360d8acb1b3c69d6a0e95827e1b08aff355f), can you update to latest development code and try again?

I hope this time I got it right.

Thanks

mxhp75 commented 6 years ago

Good morning Am I able to update to the development version in bioconda? I will try the new code when I get into the office. Melanie

lpantano commented 6 years ago

Hi,

You need to use this command to update to development:

bcbio_nextgen.py upgrade -u development

Let me know!

On Thu, Oct 25, 2018 at 5:14 PM Melanie Smith notifications@github.com wrote:

Good morning Am I able to update to the development version in bioconda? I will try the new code when I get into the office. Melanie

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-433208075, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HJvLpDBSxz5fflMW92eBLt4GNiWqks5uoimvgaJpZM4S2YrW .

mxhp75 commented 6 years ago

The above failed with

`Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

libgcc-ng

Current channels:

To search for alternate channels that may provide the conda package you're looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Traceback (most recent call last): File "/Users/a1627211/miniconda2/bin/bcbio_nextgen.py", line 221, in install.upgrade_bcbio(kwargs["args"]) File "/Users/a1627211/miniconda2/lib/python2.7/site-packages/bcbio/install.py", line 61, in upgrade_bcbio _check_for_conda_problems() File "/Users/a1627211/miniconda2/lib/python2.7/site-packages/bcbio/install.py", line 242, in _check_for_conda_problems "--yes", "-c", "bioconda", "-c", "conda-forge", "libgcc-ng"]) File "/Users/a1627211/miniconda2/lib/python2.7/subprocess.py", line 190, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/Users/a1627211/miniconda2/bin/conda', 'install', '-f', '--yes', '-c', 'bioconda', '-c', 'conda-forge', 'libgcc-ng']' returned non-zero exit status 1`

Any thoughts?

lpantano commented 6 years ago

Hi,

Am I right that you are installing bcbio in a macosx?

If that is right, I am sorry to say that bcbio is not fully compatible with macosx, and only can run on linux SO.

If you don't have a linux machine, you can use docker to get an ubuntu virtual machine and install there bcbio.

If it is a linux machine, I can try to see further what is going on.

Thanks!

On Thu, Oct 25, 2018 at 8:58 PM Melanie Smith notifications@github.com wrote:

Hi

The above failed with

`Solving environment: failed

PackagesNotFoundError: The following packages are not available from current channels:

libgcc-ng

Current channels:

https://conda.anaconda.org/bioconda/osx-64

https://conda.anaconda.org/bioconda/noarch

https://conda.anaconda.org/conda-forge/osx-64

https://conda.anaconda.org/conda-forge/noarch

https://repo.anaconda.com/pkgs/main/osx-64

https://repo.anaconda.com/pkgs/main/noarch

https://repo.anaconda.com/pkgs/free/osx-64

https://repo.anaconda.com/pkgs/free/noarch

https://repo.anaconda.com/pkgs/r/osx-64

https://repo.anaconda.com/pkgs/r/noarch

https://repo.anaconda.com/pkgs/pro/osx-64

https://repo.anaconda.com/pkgs/pro/noarch

To search for alternate channels that may provide the conda package you're looking for, navigate to

https://anaconda.org

and use the search bar at the top of the page.

Traceback (most recent call last): File "/Users/a1627211/miniconda2/bin/bcbio_nextgen.py", line 221, in install.upgrade_bcbio(kwargs["args"]) File "/Users/a1627211/miniconda2/lib/python2.7/site-packages/bcbio/install.py", line 61, in upgrade_bcbio _check_for_conda_problems() File "/Users/a1627211/miniconda2/lib/python2.7/site-packages/bcbio/install.py", line 242, in _check_for_conda_problems "--yes", "-c", "bioconda", "-c", "conda-forge", "libgcc-ng"]) File "/Users/a1627211/miniconda2/lib/python2.7/subprocess.py", line 190, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['/Users/a1627211/miniconda2/bin/conda', 'install', '-f', '--yes', '-c', 'bioconda', '-c', 'conda-forge', 'libgcc-ng']' returned non-zero exit status 1`

Any thoughts?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-433252503, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HFg_cnEkxfpCO2Y3QDrOc95nV08_ks5uol4_gaJpZM4S2YrW .

mxhp75 commented 6 years ago

Yes - I'm testing on my MacOSX. I'll need to wait until Monday when my supervisor is back for permissions on the HPC. Meanwhile, I'll set up a VM and see if I can get things rolling there.

Thanks

Melanie

rbatorsky commented 6 years ago

Hi Lorena,

If I want to both trim the first 3' adapter AGATCGGAAGAGCACACGTCT and also parse the umi that appears after the second 3' adapter AACTGTAGGCACCATCAAT, how should my template yaml look? I see the first 3' adapter is hard coded in bcbio/data/umis/qiagen_smallRNA_umi-transform.json, so I don’t understand how it is set up.

This is what I tried:

upload:
  dir: final
details:
  - files: [/cluster/tufts/rt/rbator01/work/test_label_umi/sample-cfa-486.fastq.gz]
    description: sample-cfa-486
    analysis: smallRNA-seq
    algorithm:
      aligner: star # any other aligner is supported.
      expression_caller: [trna, seqcluster, mirdeep2]
      species: cfa
      adapters: ['AGATCGGAAGAGCACACGTCT']
      umi_type: 'qiagen_smallRNA_umi'
    genome_build: canFam3

The result is that for a read like this:

@D00780:267:CCGFFANXX:1:1102:2846:1990 1:N:0:CAGATC TCCTGTACTGAGCTGCCCCGAGTAACTGTAGGCACCATCAATGCGACGCTTACAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGT + BBBBBFFFFFFFFFFFFFFFFFFBFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB

I get out this read in the umis/sample-cfa-486.umitransformed.fq.gz: @D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA TCCTGTACTGAGCTGCCCCGAGTAACTGTAGGCACCATCAATGCGACGCTTACAAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGT + BBBBBFFFFFFFFFFFFFFFFFFBFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB

But I expected something like this, with all adapters and umi removed from the sequence: @D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA TCCTGTACTGAGCTGCCCCGAGT + BBBBBFFFFFFFFFFFFFFFFFFBFF

When I ran with the above template bcbio complains about the empty seqcluster/prepare/seqs.fastq.

Here is my full log file

[2018-10-26T14:31Z] System YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/tools/bcbio/galaxy/bcbio_system.yaml
[2018-10-26T14:31Z] Resource requests: atropos, picard; memory: 4.00, 4.00; cores: 16, 16
[2018-10-26T14:31Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job
[2018-10-26T14:31Z] Timing: organize samples
[2018-10-26T14:31Z] multiprocessing: organize_samples
[2018-10-26T14:31Z] Using input YAML configuration: /cluster/tufts/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/template-srnaseq-fa.yaml
[2018-10-26T14:31Z] Checking sample YAML configuration: /cluster/tufts/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/template-srnaseq-fa.yaml
[2018-10-26T14:31Z] Testing minimum versions of installed programs
[2018-10-26T14:31Z] multiprocessing: prepare_sample
[2018-10-26T14:31Z] Preparing sample-cfa-486
[2018-10-26T14:31Z] Timing: adapter trimming
[2018-10-26T14:31Z] multiprocessing: trim_srna_sample
[2018-10-26T14:31Z] Inserting UMI and barcode information into the read name of /cluster/tufts/rt/rbator01/work/test_label_umi/sample-cfa-486.fastq.gz
[2018-10-26T14:31Z] INFO:umis.umis:Detected UMI.
[2018-10-26T14:31Z] Adapter is set up in config file, but trim_reads is not true.If you want to skip trimming, skip adapter option from config.
[2018-10-26T14:31Z] remove adapter for sample-cfa-486
[2018-10-26T14:31Z] Running seqcluster collapse in /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean.fq.gz.
[2018-10-26T14:31Z] INFO Run collapse
[2018-10-26T14:31Z] INFO Find UMI tags in read names, collapsing by UMI.
[2018-10-26T14:31Z] INFO Sequences loaded: 1302
[2018-10-26T14:31Z] INFO Writing sequences to /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean_umi_trimmed.fastq
[2018-10-26T14:31Z] INFO Sequences loaded: 1302
[2018-10-26T14:31Z] INFO Writing sequences to /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean_trimmed.fastq
[2018-10-26T14:31Z] INFO It took 0.006 minutes
[2018-10-26T14:31Z] ['collapse', '-o', '/cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486', '-f', '/cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean.fq.gz', '-m', '1', '--min_size', '16']
[2018-10-26T14:31Z] Resource requests: picard, samtools, star; memory: 4.00, 4.00, 4.00; cores: 16, 16, 16
[2018-10-26T14:31Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job
[2018-10-26T14:31Z] Timing: prepare
[2018-10-26T14:31Z] multiprocessing: seqcluster_prepare
[2018-10-26T14:31Z] Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1
[2018-10-26T14:31Z] Timing: seqcluster alignment
[2018-10-26T14:31Z] multiprocessing: srna_alignment
Traceback (most recent call last):
  File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 238, in <module>
    main(**kwargs)
  File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 46, in main
    run_main(**kwargs)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main
    fc_dir, run_info_yaml)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel
    for xs in pipeline(config, run_info_yaml, parallel, dirs, samples):
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 328, in smallrnaseqpipeline
    samples = run_parallel("srna_alignment", [samples])
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel
    return run_multicore(fn, items, config, parallel=parallel)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore
    for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items):
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper
    return apply(f, *args, **kwargs)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 172, in srna_alignment
    return seqcluster.run_align(*args)
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/srna/group.py", line 69, in run_align
    sample = process_alignment(data[0][0], [seq_out, None])
  File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 181, in process_alignment
    "If it is a fastq file (not pre-aligned BAM or CRAM), "
ValueError: Could not process input file from sample configuration. 
/cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/seqcluster/prepare/seqs.fastq
Is the path to the file correct or is empty?
If it is a fastq file (not pre-aligned BAM or CRAM), is an aligner specified in the input configuration?

Here is the fastq I'm trying. I extracted all cfa-486 sequences from a real run as a test. sample-cfa-486.fastq.gz

Thanks again for your work on thi!

lpantano commented 6 years ago

Hi,

The template is almost correct but the adapter in your yaml to use should be: AACTGTAGGCACCATCAAT.

The transform file only contain the information to put the UMI in the name, then you need to trim the first adapter found in the read that will leave the real small RNA in your sample.

Can you try that and let me know?

Thanks

On Fri, Oct 26, 2018 at 11:14 AM rbatorsky-claritas < notifications@github.com> wrote:

Hi Lorena,

If I want to both trim the first 3' adapter AGATCGGAAGAGCACACGTCT and also parse the umi that appears after the second 3' adapter AACTGTAGGCACCATCAAT, how should my template yaml look? I see the first 3' adapter is hard coded in bcbio/data/umis/qiagen_smallRNA_umi-transform.json, so I don’t understand how it is set up.

This is what I tried:

upload: dir: final details:

files: [/cluster/tufts/rt/rbator01/work/test_label_umi/sample-cfa-486.fastq.gz] description: sample-cfa-486 analysis: smallRNA-seq algorithm: aligner: star # any other aligner is supported. expression_caller: [trna, seqcluster, mirdeep2] species: cfa adapters: ['AGATCGGAAGAGCACACGTCT'] umi_type: 'qiagen_smallRNA_umi' genome_build: canFam3

The result is that for a read like this:

@D00780:267:CCGFFANXX:1:1102:2846:1990 1:N:0:CAGATC TCCTGTACTGAGCTGCCCCGAGTAACTGTAGGCACCATCAATGCGACGCTTACA AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGT +

BBBBBFFFFFFFFFFFFFFFFFFBFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB

I get out this read: @D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA TCCTGTACTGAGCTGCCCCGAGTAACTGTAGGCACCATCAATGCGACGCTTACA AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGT +

BBBBBFFFFFFFFFFFFFFFFFFBFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFB

But I expected something like this, with all adapters and umi removed from the sequence: @D00780:267:CCGFFANXX:1:1102:2846:1990:UMI_GCGACGCTTACA TCCTGTACTGAGCTGCCCCGAGT + BBBBBFFFFFFFFFFFFFFFFFFBFF

When I ran with the above template bcbio complains about the empty seqcluster/prepare/seqs.fastq.

Here is my full log file

[2018-10-26T14:31Z] System YAML configuration: /cluster/kappa/90-days-archive/rt/rbator01/tools/bcbio/galaxy/bcbio_system.yaml [2018-10-26T14:31Z] Resource requests: atropos, picard; memory: 4.00, 4.00; cores: 16, 16 [2018-10-26T14:31Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job [2018-10-26T14:31Z] Timing: organize samples [2018-10-26T14:31Z] multiprocessing: organize_samples [2018-10-26T14:31Z] Using input YAML configuration: /cluster/tufts/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/template-srnaseq-fa.yaml [2018-10-26T14:31Z] Checking sample YAML configuration: /cluster/tufts/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/template-srnaseq-fa.yaml [2018-10-26T14:31Z] Testing minimum versions of installed programs [2018-10-26T14:31Z] multiprocessing: prepare_sample [2018-10-26T14:31Z] Preparing sample-cfa-486 [2018-10-26T14:31Z] Timing: adapter trimming [2018-10-26T14:31Z] multiprocessing: trim_srna_sample [2018-10-26T14:31Z] Inserting UMI and barcode information into the read name of /cluster/tufts/rt/rbator01/work/test_label_umi/sample-cfa-486.fastq.gz [2018-10-26T14:31Z] INFO:umis.umis:Detected UMI. [2018-10-26T14:31Z] Adapter is set up in config file, but trim_reads is not true.If you want to skip trimming, skip adapter option from config. [2018-10-26T14:31Z] remove adapter for sample-cfa-486 [2018-10-26T14:31Z] Running seqcluster collapse in /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean.fq.gz. [2018-10-26T14:31Z] INFO Run collapse [2018-10-26T14:31Z] INFO Find UMI tags in read names, collapsing by UMI. [2018-10-26T14:31Z] INFO Sequences loaded: 1302 [2018-10-26T14:31Z] INFO Writing sequences to /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean_umi_trimmed.fastq [2018-10-26T14:31Z] INFO Sequences loaded: 1302 [2018-10-26T14:31Z] INFO Writing sequences to /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean_trimmed.fastq [2018-10-26T14:31Z] INFO It took 0.006 minutes [2018-10-26T14:31Z] ['collapse', '-o', '/cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486', '-f', '/cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/trimmed/sample-cfa-486/sample-cfa-486.umitransformed.clean.fq.gz', '-m', '1', '--min_size', '16'] [2018-10-26T14:31Z] Resource requests: picard, samtools, star; memory: 4.00, 4.00, 4.00; cores: 16, 16, 16 [2018-10-26T14:31Z] Configuring 1 jobs to run, using 1 cores each with 4.00g of memory reserved for each job [2018-10-26T14:31Z] Timing: prepare [2018-10-26T14:31Z] multiprocessing: seqcluster_prepare [2018-10-26T14:31Z] Prepare seqs.fastq with -minl 17 -maxl 40 -minc 2 --min_shared 0.1 [2018-10-26T14:31Z] Timing: seqcluster alignment [2018-10-26T14:31Z] multiprocessing: srna_alignment Traceback (most recent call last): File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 238, in main(kwargs) File "/cluster/tufts/rt/rbator01/tools/bin/bcbio_nextgen.py", line 46, in main run_main(kwargs) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 45, in run_main fc_dir, run_info_yaml) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 89, in _run_toplevel for xs in pipeline(config, run_info_yaml, parallel, dirs, samples): File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/main.py", line 328, in smallrnaseqpipeline samples = run_parallel("srna_alignment", [samples]) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 28, in run_parallel return run_multicore(fn, items, config, parallel=parallel) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multi.py", line 86, in run_multicore for data in joblib.Parallel(parallel["num_jobs"], batch_size=1)(joblib.delayed(fn)(x) for x in items): File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 983, in call if self.dispatch_one_batch(iterator): File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 825, in dispatch_one_batch self._dispatch(tasks) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 782, in _dispatch job = self._backend.apply_async(batch, callback=cb) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/_parallel_backends.py", line 545, in init self.results = batch() File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/joblib/parallel.py", line 261, in call for func, args, kwargs in self.items] File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/utils.py", line 52, in wrapper return apply(f, *args, *kwargs) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/distributed/multitasks.py", line 172, in srna_alignment return seqcluster.run_align(args) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/srna/group.py", line 69, in run_align sample = process_alignment(data[0][0], [seq_out, None]) File "/cluster/tufts/rt/rbator01/tools/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/sample.py", line 181, in process_alignment "If it is a fastq file (not pre-aligned BAM or CRAM), " ValueError: Could not process input file from sample configuration. /cluster/kappa/90-days-archive/rt/rbator01/work/test_bcbio/test-umi-protocol-trim/seqcluster/prepare/seqs.fastq Is the path to the file correct or is empty? If it is a fastq file (not pre-aligned BAM or CRAM), is an aligner specified in the input configuration?

Here is the fastq I'm trying. I extracted all cfa-486 sequences from a real run as a test. sample-cfa-486.fastq.gz https://github.com/bcbio/bcbio-nextgen/files/2519736/sample-cfa-486.fastq.gz

Thanks again for your work on thi!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-433442862, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HLnZzX6PuK22NtCTGKA-YIH88M2sks5uoybrgaJpZM4S2YrW .

mxhp75 commented 6 years ago

Hi Lorena

I have set up a ubuntu VM and updated the bcbio to development version (I think) and set up a new Conda environment to run from. On trying to run using 1 .fastq.gz file, bcbio cannot find the bcbio_system.yaml Perhaps I haven't set up the environment correctly, but I'm not sure how to check that. Any thoughts?

Melanie

lpantano commented 6 years ago

Hi,

did you install it using the command on the readme page:

https://github.com/bcbio/bcbio-nextgen/blob/master/docs/contents/installation.rst#installation

Or did you use another kind of installation commands?

Cheers

On Sun, Oct 28, 2018 at 8:19 PM Melanie Smith notifications@github.com wrote:

Hi Lorena

I have set up a ubuntu VM and updated the bcbio to development version (I think) and set up a new Conda environment to run from. On trying to run using 1 .fastq.gz file, bcbio cannot find the bcbio_system.yaml Perhaps I haven't set up the environment correctly, but I'm not sure how to check that. Any thoughts?

Melanie

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-433756169, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HA_E_1ZBbXlTTO53hh2q-i8fkilHks5upkmZgaJpZM4S2YrW .

mxhp75 commented 6 years ago

Good morning

I did use the command on the readme page but I see now that I have a permissions issue. I'll remedy that and try again.

Melanie

rbatorsky commented 6 years ago

Hi Lorena,

I've rerun with the correct adapter in the template and the pipeline completes successfully (yay!).

I'm comparing the results to results I have from another pipeline to quantify small rna using qiaseq umi. I find that my previous pipeline gives me roughly counts roughly half as large as bcbio for a given sample and mirna.

For example, for the top 3 mirna in one sample (using the counts_mirna.tsv in the project directory):

	other pipeline	bcbio
cfa-miR-486	69578	116237
cfa-miR-16	8439	15732
cfa-miR-92a	7797	13436

I’d like to dig into this, so I’m doing some manual counting of cfa-miR-486.

The sequence is TCCTGTACTGAGCTGCCCCGA and just using grep on a large substring in the raw data I see 2,965,680 sequences.

In the file umitransformed.clean_umi_trimmed.fastq, there are 69310 sequences and if I use the _x to count the number of sequences and get 2,798,800 (i.e. count _x5 5 times). Am I correct that this file should have _x5 meaning that this sequence was seen with 5 different umi?

Can you point me to any other intermediate file to understand a manual way to verify that the counting in bcbio is right? Next I was going to try and generate some simulated data with umi, which will take some time.

Lastly, how do I find the mirbase version that is used in the pipeline? I see in data_versions.tsv canFam3,srnaseq,20181024 and in the mirbase/sample.gff

## mirGFF3. VERSION 0.9
## source-ontology: miRBasev21 doi:10.25504/fairsharing.hmgte8
## COLDATA: test

Is it correct that mirbase v21 is being used?

Thank you again for all your help! Best, Rebecca

rbatorsky commented 6 years ago

Hi again,

I generated some simulated reads with set number of mirna, set number of umi, and set number of duplicates per mirna+umi combinations In this simple case it seems that the counting are done correctly in bcbio from beginning to end, the counts are all as expected. Guess I’ll be looking at what my other pipeline is doing wrong 🙂

Thank you, Rebecca

lpantano commented 6 years ago

Hi Rebecca,

Thanks for all this work, is super great!

In general, bcbio will detect more miRNAs, because it detects more isomiRs (miRNA variants). Many pipelines are very restrictive about the mapping with the good and bad consequence of that.

The clean_umi_trimmed.fastq file will have the counts of reads for different UMI, and the file clean_trimmed.fastq will have the counts of different UMI. This is to be compatible with the non-umi pipeline file name system.

Yes it is using 21, because I didn't test 22 yet. But I am happy to update if you want the new version for this species.

I am happy to talk further about this if you have more question or something doesn't make a lot of sense. I am glad that simulated data gives the expected results! Send me an email if you want to talk further.

Cheers

On Tue, Oct 30, 2018 at 3:36 PM rbatorsky-claritas notifications@github.com wrote:

Hi Lorena,

I've rerun with the correct adapter in the template and the pipeline completes successfully (yay!).

I'm comparing the results to results I have from another pipeline to quantify small rna using qiaseq umi. I find that my previous pipeline gives me roughly counts roughly half as large as bcbio for a given sample and mirna.

For example, for the top 3 mirna in one sample (using the counts_mirna.tsv in the project directory): other pipeline bcbio cfa-miR-486 69578 116237 cfa-miR-16 8439 15732 cfa-miR-92a 7797 13436

I’d like to dig into this, so I’m doing some manual counting of cfa-miR-486.

The sequence is TCCTGTACTGAGCTGCCCCGA and just using grep on a large substring in the raw data I see 2,965,680 sequences.

In the file umitransformed.clean_umi_trimmed.fastq, there are 69310 sequences and if I use the _x to count the number of sequences and get 2,798,800 (i.e. count _x5 5 times). Am I correct that this file should have _x5 meaning that this sequence was seen with 5 different umi?

Can you point me to any other intermediate file to understand a manual way to verify that the counting in bcbio is right? Next I was going to try and generate some simulated data with umi, which will take some time.

Lastly, how do I find the mirbase version that is used in the pipeline? I see in data_versions.tsv canFam3,srnaseq,20181024 and in the mirbase/sample.gff

mirGFF3. VERSION 0.9

source-ontology: miRBasev21 doi:10.25504/fairsharing.hmgte8

COLDATA: test

Is it correct that mirbase v21 is being used?

Thank you again for all your help! Best, Rebecca

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/bcbio/bcbio-nextgen/issues/2347#issuecomment-434439390, or mute the thread https://github.com/notifications/unsubscribe-auth/ABi_HFMLXTz6hJxrxIrubgh8ijCTqqhrks5uqKovgaJpZM4S2YrW .

rbatorsky commented 6 years ago

Thanks again for your help! We are using v22 in our analysis, but I checked and at least in the canine mature.fa there aren't differences with v21. If you get around to it, having v22 would be great.