chasem77 / ea-utils

Automatically exported from code.google.com/p/ea-utils
0 stars 0 forks source link

fastq-mcf does not process all fasta entries #12

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. make a fasta file with 14 entries called test.fa
2. run fastq-mcf as: fastq-mcf test.fa <fastq_set_of_choice>
3. Watch the log: only the first 12 entries are reported.

What is the expected output? What do you see instead?
Only the first 12 entries in the fasta file are processed, instead of all 
entries.

The input (fasta entries named from A to N)
==========
$ cat contlist.fa 
>contaA
CAACCATTCATTCCAGCCTTCAATTAAAAGACTAATGATTATGCTACCTT
>contaB
GATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGC
>contaC
AAAAAATTAAGTTACTTTAGGGATAACAGCGTAATTTTTTTGGAGAGTTC
>contaD
GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCAACCTG
>contaE
CTCGTCTTTTAAATAAATTTTAGCTTTTTGACTAAAAAATAAAATTCTAT
>contaF
GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGC
>contaG
CAAAAACATGTCTTTTTGAATTATATATAAAGTCTAACCTGCCCACTGAA
>contaH
CTAAAATATCACAATTTTTTAAAGATAGAAACCAACCTGGCTTACACCGG
>contaI
ATTTTTTTGGAGAGTTCATATCGATAAAAAAGATTGCGACCTCGATGTTG
>contaJ
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCGTCCCGATCTCGTA
>contaK
GTCCTTTCGTACTAAAATATCATAATTTTTTAAAGATAGAAACCAACCTG
>contaL
CTGGCTTACACCGGTTTGAACTCAGATCATGTAAGAATTTAAAAGTCGAA
>contaM
GATCGGAAGAGCACACGTCTGAACTCCAGTCACGCCAATATCTCGTATGC
>contaN
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGC

The output (entries A to L processed, not M and N)
==========
$ cat fastqmcf.log 
Scale used: 2.2
Phred: 33
Threshold used: 751 out of 300000
Adapter contaA (CAACCATTCATTCCAGCCTTCAATTAAAAGACTAATGATTATGCTACCTT): counted 
1101 at the 'start' of 'readsample.fastq', clip set to 7
Adapter contaC (AAAAAATTAAGTTACTTTAGGGATAACAGCGTAATTTTTTTGGAGAGTTC): counted 
1171 at the 'start' of 'readsample.fastq', clip set to 7
Adapter contaD (GTCCTTTCGTACTAAAATATCACAATTTTTTAAAGATAGAAACCAACCTG): counted 
1398 at the 'start' of 'readsample.fastq', clip set to 6
Adapter contaE (CTCGTCTTTTAAATAAATTTTAGCTTTTTGACTAAAAAATAAAATTCTAT): counted 
1051 at the 'start' of 'readsample.fastq', clip set to 7
Adapter contaG (CAAAAACATGTCTTTTTGAATTATATATAAAGTCTAACCTGCCCACTGAA): counted 
1044 at the 'end' of 'readsample.fastq', clip set to 7
Adapter contaH (CTAAAATATCACAATTTTTTAAAGATAGAAACCAACCTGGCTTACACCGG): counted 
640 at the 'start' of 'readsample.fastq', clip set to 7
Adapter contaI (ATTTTTTTGGAGAGTTCATATCGATAAAAAAGATTGCGACCTCGATGTTG): counted 
1014 at the 'start' of 'readsample.fastq', clip set to 7, warning end was not 
reliable
Adapter contaK (GTCCTTTCGTACTAAAATATCATAATTTTTTAAAGATAGAAACCAACCTG): counted 
933 at the 'start' of 'readsample.fastq', clip set to 7
Adapter contaL (CTGGCTTACACCGGTTTGAACTCAGATCATGTAAGAATTTAAAAGTCGAA): counted 
1164 at the 'start' of 'readsample.fastq', clip set to 7
Files: 1
Total reads: 300000
Too short after clip: 6318
Clipped 'start' reads: Count: 2215, Mean: 19.34, Sd: 7.81
Clipped 'end' reads: Count: 899, Mean: 20.02, Sd: 8.77
Trimmed 9816 reads by an average of 8.33 bases on quality < 7

What version of the product are you using? On what operating system?
Last version, on CentOS6.3 64 bit.

Please provide any additional information below.
I am using fastq-mcf as a galaxy plugin. The behaviour is not due to the galaxy 
plugin code, but due to genuine fastq-mcf behaviour apparently. Thanks for your 
consideration!

See http://toolshed.g2.bx.psu.edu/

Original issue reported on code.google.com by joachim....@gmail.com on 11 Feb 2013 at 9:44

GoogleCodeExporter commented 9 years ago
Edit: my version is not the latest, but  ea-utils.1.1.2-484

Original comment by joachim....@gmail.com on 11 Feb 2013 at 9:57

GoogleCodeExporter commented 9 years ago
The program, by default, is looking for lines with adapter strings out of 
300000 samples taken.   contaA,C,D,E,G,I,K,L were all found (counted 640 out of 
.... ).   Others were not found in sufficient numbers.   So they were not 
clipped.

You cold run with -t .00001 ... forcing it to clip if it sees just one example.

Note: This program should only be used to remove "ligated-sequencing adapters" 
- not other forms of sequencing contamination.   In this case only 6-7 bases 
would have to match for the various sequence to be clipped.   In typical 
Illumina runs, only 1-2 adapters are found in high enough numbers to be clipped.

Original comment by earone...@gmail.com on 12 Feb 2013 at 6:31

GoogleCodeExporter commented 9 years ago
Note: If you attach a file containing at least 300k reads to this ticket, I can 
test it out, make sure it's behaving as designed.   Also, I can probably get a 
-t 0 to work (right now I'm not sure what -t 0 will do)

Original comment by earone...@gmail.com on 12 Feb 2013 at 6:33

GoogleCodeExporter commented 9 years ago
Thanks for the explanation. Okay - so basically I am using it for the wrong 
purpose... I search for overrepresented strings in my RNA-seq data: in this 
list nearly always the sequencing primer pops up, next to other strings.

During my current tests, I try to use Fastq-mcf to remove all overrepresented 
strings, not only sequencing adaptor contamination.

Biological contamination, I remove with mapping to the expected sources (rRNA, 
mtDNA, phiX genome). But the 'technical contamination', I currently try to 
remove with fastq-mcf, for adapter removal, a set of polyA sequences (yes, 
which I consider also technical), and other unidentified overrepresented 
strings.

So apparently, fastq-mcf is not designed for what I am trying to do: but since 
polyA/T removal would be a nice feature (and since you already have the 
N-removal feature at the end of reads), perhaps it's worthwhile and rather easy 
to implement?

BTW, I have made a galaxy wrapper around fastq-mcf. With your consent I can 
upload it to the toolshed once testing has ended. See http://galaxyproject.org/

Thanks.

Original comment by joachim....@gmail.com on 13 Feb 2013 at 7:59

GoogleCodeExporter commented 9 years ago
True, polyA/T is easy enough.   I guess a parameter like -P 6, woudl remove 6 
or more A/T's from the ends of reads *after* adapter removal.   In my 
experience, however, polyA tails on RNA are desirable, anchoring assembly and 
alignment, and you're better off designing a a polyA transcriptome to align to 
... the way that rsem-prepare-reference does it.   

http://www.biomedcentral.com/1471-2105/12/323

Original comment by earone...@gmail.com on 13 Feb 2013 at 3:16

GoogleCodeExporter commented 9 years ago
Uploading to galaxy is fine.   If you need me to make changes, let me know.  
Also the new version is streaming (it buffers the 300k sampling reads in RAM... 
so you can use it as a part of a proper pipeline).

Original comment by earone...@gmail.com on 13 Feb 2013 at 3:17

GoogleCodeExporter commented 9 years ago
Thanks for the interesting RSEM article. It has been a long time already on my 
reading list (...), but now I've read it, and indeed contains a lot of valuable 
information. I will definitely compare RSEM to my current workflow.

Indeed, the polyA may be desirable in the case of transcriptome alignment, in 
contrast to genome alignment. So it is up to you whether to include the polyA 
trimming. I can place the functionality of fastq-mcf now more into context. 

Again, thanks for the effort!

Original comment by joachim....@gmail.com on 14 Feb 2013 at 9:33

GoogleCodeExporter commented 9 years ago
Even with genome alignment, programs like tophat and mapsplice deal with poly-A 
tails on their own, rather than relying.  I've done some tests using SEQC data 
and exome/rna seq comparisons and have found that a) RSEM and eXpress are the 
most accurate quantifiers (roughly equivocal) b) aligning to the transcriptome 
is an important first step for any tool (then you can align unaligned reads to 
the genome). and c) bwa is pretty much always better than bowtie for accuracy 
but you have to deal with a lot of its operational shortcomings

Original comment by earone...@gmail.com on 4 Mar 2013 at 2:19

GoogleCodeExporter commented 9 years ago

Original comment by earone...@gmail.com on 4 Mar 2013 at 2:28

GoogleCodeExporter commented 9 years ago
The poly-A feature has been added.   

Original comment by earone...@gmail.com on 24 Apr 2013 at 2:04