Zymo-Research / figaro

An efficient and objective tool for optimizing microbiome rRNA gene trimming parameters
GNU General Public License v3.0
78 stars 24 forks source link

Errors unsing figaro with v3-v4 sequencing file MiSeq #37

Open vehamel opened 3 years ago

vehamel commented 3 years ago

Hi!

It is the first time I am using figaro. I am not an usual user of Python, so that's why I am asking help. I don't know what to do with this output and how to interprete it. I was using figaro to help me choose how to Trim my sequences because I find the quality poor.

Thanks a lot for your help!

Here it is what it run :

Forward read files appear to be of different lengths or of varied lengths. {(300, 0.7550505050505051), (299, 0.945050505050505), (300, 0.6905050505050505), (299, 0.8581818181818182), (300, 0.8383838383838383), (299, 0.9797979797979798), (299, 1.3761616161616161), (299, 0.854949494949495), (299, 0.9090909090909091), (299, 1.9526262626262625), (299, 1.0460606060606061), (299, 0.9923232323232323), (300, 0.3405050505050505), (299, 1.9267676767676767), (299, 0.7550505050505051), (299, 1.0233333333333334), (300, 1.5252525252525253), (299, 0.831919191919192), (300, 0.8145454545454546), (299, 0.7963636363636364), (299, 0.8504040404040404), (300, 0.6540404040404041), (300, 0.7146464646464646), (299, 0.9465656565656566), (300, 0.797979797979798), (299, 1.0925252525252525), (300, 0.6440404040404041), (300, 1.4343434343434343), (300, 0.7337373737373738), (299, 0.8819191919191919), (299, 0.753939393939394), (299, 0.9716161616161616), (299, 1.2044444444444444), (299, 1.0953535353535353), (299, 0.8723232323232324), (299, 0.8145454545454546), (299, 0.8686868686868686), (300, 0.6944444444444444), (299, 0.9166666666666666), (299, 1.8226262626262626), (300, 0.7167676767676767), (299, 0.9837373737373737), (299, 1.1268686868686868), (299, 1.0920202020202021), (300, 0.7135353535353536), (299, 0.7975757575757575), (299, 1.7006060606060607)} Reverse read files appear to be of different lengths or of varied lengths. {(300, 0.6524242424242425), (300, 0.7228282828282828), (300, 0.34454545454545454), (300, 0.2771717171717172), (300, 0.5175757575757576), (300, 0.805959595959596), (300, 0.39555555555555555), (300, 0.5716161616161616), (300, 0.5268686868686868), (300, 0.19989898989898988), (300, 0.2832323232323232), (300, 0.5268686868686869), (300, 0.34383838383838383), (300, 0.21575757575757576), (300, 0.5425252525252525), (300, 0.8117171717171717), (300, 0.7348484848484849), (300, 0.4302020202020202), (300, 0.4011111111111111), (300, 0.49737373737373736), (299, 0.8988888888888888), (300, 0.4908080808080808), (300, 0.7632323232323233), (300, 0.9586868686868687), (300, 0.5066666666666667), (300, 0.6475757575757576), (300, 0.16353535353535353), (300, 0.45202020202020204), (300, 0.6666666666666667), (300, 0.612020202020202), (300, 0.3106060606060606), (300, 0.9995959595959596), (300, 0.5732323232323232), (300, 0.7272727272727273), (300, 0.6565656565656566), (300, 0.553030303030303), (300, 0.4670707070707071), (300, 0.38383838383838387), (300, 0.8771717171717172), (300, 0.547070707070707), (300, 0.8484848484848485), (299, 0.9389898989898989), (300, 0.5276767676767676), (300, 0.7070707070707071), (300, 0.7485858585858586)} Forward reads appear to not be of consistent length. {(300, 0.7550505050505051), (299, 0.945050505050505), (300, 0.6905050505050505), (299, 0.8581818181818182), (300, 0.8383838383838383), (299, 0.9797979797979798), (299, 1.3761616161616161), (299, 0.854949494949495), (299, 0.9090909090909091), (299, 1.9526262626262625), (299, 1.0460606060606061), (299, 0.9923232323232323), (300, 0.3405050505050505), (299, 1.9267676767676767), (299, 0.7550505050505051), (299, 1.0233333333333334), (300, 1.5252525252525253), (299, 0.831919191919192), (300, 0.8145454545454546), (299, 0.7963636363636364), (299, 0.8504040404040404), (300, 0.6540404040404041), (300, 0.7146464646464646), (299, 0.9465656565656566), (300, 0.797979797979798), (299, 1.0925252525252525), (300, 0.6440404040404041), (300, 1.4343434343434343), (300, 0.7337373737373738), (299, 0.8819191919191919), (299, 0.753939393939394), (299, 0.9716161616161616), (299, 1.2044444444444444), (299, 1.0953535353535353), (299, 0.8723232323232324), (299, 0.8145454545454546), (299, 0.8686868686868686), (300, 0.6944444444444444), (299, 0.9166666666666666), (299, 1.8226262626262626), (300, 0.7167676767676767), (299, 0.9837373737373737), (299, 1.1268686868686868), (299, 1.0920202020202021), (300, 0.7135353535353536), (299, 0.7975757575757575), (299, 1.7006060606060607)} Reverse reads appear to not be of consistent length. {(300, 0.6524242424242425), (300, 0.7228282828282828), (300, 0.34454545454545454), (300, 0.2771717171717172), (300, 0.5175757575757576), (300, 0.805959595959596), (300, 0.39555555555555555), (300, 0.5716161616161616), (300, 0.5268686868686868), (300, 0.19989898989898988), (300, 0.2832323232323232), (300, 0.5268686868686869), (300, 0.34383838383838383), (300, 0.21575757575757576), (300, 0.5425252525252525), (300, 0.8117171717171717), (300, 0.7348484848484849), (300, 0.4302020202020202), (300, 0.4011111111111111), (300, 0.49737373737373736), (299, 0.8988888888888888), (300, 0.4908080808080808), (300, 0.7632323232323233), (300, 0.9586868686868687), (300, 0.5066666666666667), (300, 0.6475757575757576), (300, 0.16353535353535353), (300, 0.45202020202020204), (300, 0.6666666666666667), (300, 0.612020202020202), (300, 0.3106060606060606), (300, 0.9995959595959596), (300, 0.5732323232323232), (300, 0.7272727272727273), (300, 0.6565656565656566), (300, 0.553030303030303), (300, 0.4670707070707071), (300, 0.38383838383838387), (300, 0.8771717171717172), (300, 0.547070707070707), (300, 0.8484848484848485), (299, 0.9389898989898989), (300, 0.5276767676767676), (300, 0.7070707070707071), (300, 0.7485858585858586)} Traceback (most recent call last): File "C:\Users\veham18\figaro\figaro\figaro.py", line 218, in main() File "C:\Users\veham18\figaro\figaro\figaro.py", line 210, in main resultTable, forwardCurve, reverseCurve = trimParameterPrediction.performAnalysisLite(parameters.inputDirectory.value, parameters.minimumCombinedReadLength.value, subsample = parameters.subsample.value, percentile = parameters.percentile.value, forwardPrimerLength=parameters.forwardPrimerLength.value, reversePrimerLength=parameters.reversePrimerLength.value, namingStandardAlias=fileNamingStandard) File "C:\Users\veham18\figaro\figaro\trimParameterPrediction.py", line 448, in performAnalysisLite forwardReadLength, reverseReadLength = checkReadLengths(fastqList) File "C:\Users\veham18\figaro\figaro\trimParameterPrediction.py", line 407, in checkReadLengths raise fastqHandler.FastqValidationError("Unable to validate fastq files enough to perform this operation. Please check log for specific error(s).") fastqHandler.FastqValidationError: Unable to validate fastq files enough to perform this operation. Please check log for specific error(s).

michael-weinstein commented 3 years ago

This error usually happens because of reads that were pre-trimmed and of varying length. Do you know if that was the case here?

From: vehamel @.> Sent: Thursday, May 13, 2021 7:23 AM To: Zymo-Research/figaro @.> Cc: Subscribed @.***> Subject: [Zymo-Research/figaro] Errors unsing figaro with v3-v4 sequencing file MiSeq (#37)

Hi!

It is the first time I am using figaro. I am not an usual user of Python, so that's why I am asking help. I don't know what to do with this output and how to interprete it. I was using figaro to help me choose how to Trim my sequences because I find the quality poor.

Thanks a lot for your help!

Here it is what it run :

Forward read files appear to be of different lengths or of varied lengths. {(300, 0.7550505050505051), (299, 0.945050505050505), (300, 0.6905050505050505), (299, 0.8581818181818182), (300, 0.8383838383838383), (299, 0.9797979797979798), (299, 1.3761616161616161), (299, 0.854949494949495), (299, 0.9090909090909091), (299, 1.9526262626262625), (299, 1.0460606060606061), (299, 0.9923232323232323), (300, 0.3405050505050505), (299, 1.9267676767676767), (299, 0.7550505050505051), (299, 1.0233333333333334), (300, 1.5252525252525253), (299, 0.831919191919192), (300, 0.8145454545454546), (299, 0.7963636363636364), (299, 0.8504040404040404), (300, 0.6540404040404041), (300, 0.7146464646464646), (299, 0.9465656565656566), (300, 0.797979797979798), (299, 1.0925252525252525), (300, 0.6440404040404041), (300, 1.4343434343434343), (300, 0.7337373737373738), (299, 0.8819191919191919), (299, 0.753939393939394), (299, 0.9716161616161616), (299, 1.2044444444444444), (299, 1.0953535353535353), (299, 0.8723232323232324), (299, 0.8145454545454546), (299, 0.8686868686868686), (300, 0.6944444444444444), (299, 0.9166666666666666), (299, 1.8226262626262626), (300, 0.7167676767676767), (299, 0.9837373737373737), (299, 1.1268686868686868), (299, 1.0920202020202021), (300, 0.7135353535353536), (299, 0.7975757575757575), (299, 1.7006060606060607)} Reverse read files appear to be of different lengths or of varied lengths. {(300, 0.6524242424242425), (300, 0.7228282828282828), (300, 0.34454545454545454), (300, 0.2771717171717172), (300, 0.5175757575757576), (300, 0.805959595959596), (300, 0.39555555555555555), (300, 0.5716161616161616), (300, 0.5268686868686868), (300, 0.19989898989898988), (300, 0.2832323232323232), (300, 0.5268686868686869), (300, 0.34383838383838383), (300, 0.21575757575757576), (300, 0.5425252525252525), (300, 0.8117171717171717), (300, 0.7348484848484849), (300, 0.4302020202020202), (300, 0.4011111111111111), (300, 0.49737373737373736), (299, 0.8988888888888888), (300, 0.4908080808080808), (300, 0.7632323232323233), (300, 0.9586868686868687), (300, 0.5066666666666667), (300, 0.6475757575757576), (300, 0.16353535353535353), (300, 0.45202020202020204), (300, 0.6666666666666667), (300, 0.612020202020202), (300, 0.3106060606060606), (300, 0.9995959595959596), (300, 0.5732323232323232), (300, 0.7272727272727273), (300, 0.6565656565656566), (300, 0.553030303030303), (300, 0.4670707070707071), (300, 0.38383838383838387), (300, 0.8771717171717172), (300, 0.547070707070707), (300, 0.8484848484848485), (299, 0.9389898989898989), (300, 0.5276767676767676), (300, 0.7070707070707071), (300, 0.7485858585858586)} Forward reads appear to not be of consistent length. {(300, 0.7550505050505051), (299, 0.945050505050505), (300, 0.6905050505050505), (299, 0.8581818181818182), (300, 0.8383838383838383), (299, 0.9797979797979798), (299, 1.3761616161616161), (299, 0.854949494949495), (299, 0.9090909090909091), (299, 1.9526262626262625), (299, 1.0460606060606061), (299, 0.9923232323232323), (300, 0.3405050505050505), (299, 1.9267676767676767), (299, 0.7550505050505051), (299, 1.0233333333333334), (300, 1.5252525252525253), (299, 0.831919191919192), (300, 0.8145454545454546), (299, 0.7963636363636364), (299, 0.8504040404040404), (300, 0.6540404040404041), (300, 0.7146464646464646), (299, 0.9465656565656566), (300, 0.797979797979798), (299, 1.0925252525252525), (300, 0.6440404040404041), (300, 1.4343434343434343), (300, 0.7337373737373738), (299, 0.8819191919191919), (299, 0.753939393939394), (299, 0.9716161616161616), (299, 1.2044444444444444), (299, 1.0953535353535353), (299, 0.8723232323232324), (299, 0.8145454545454546), (299, 0.8686868686868686), (300, 0.6944444444444444), (299, 0.9166666666666666), (299, 1.8226262626262626), (300, 0.7167676767676767), (299, 0.9837373737373737), (299, 1.1268686868686868), (299, 1.0920202020202021), (300, 0.7135353535353536), (299, 0.7975757575757575), (299, 1.7006060606060607)} Reverse reads appear to not be of consistent length. {(300, 0.6524242424242425), (300, 0.7228282828282828), (300, 0.34454545454545454), (300, 0.2771717171717172), (300, 0.5175757575757576), (300, 0.805959595959596), (300, 0.39555555555555555), (300, 0.5716161616161616), (300, 0.5268686868686868), (300, 0.19989898989898988), (300, 0.2832323232323232), (300, 0.5268686868686869), (300, 0.34383838383838383), (300, 0.21575757575757576), (300, 0.5425252525252525), (300, 0.8117171717171717), (300, 0.7348484848484849), (300, 0.4302020202020202), (300, 0.4011111111111111), (300, 0.49737373737373736), (299, 0.8988888888888888), (300, 0.4908080808080808), (300, 0.7632323232323233), (300, 0.9586868686868687), (300, 0.5066666666666667), (300, 0.6475757575757576), (300, 0.16353535353535353), (300, 0.45202020202020204), (300, 0.6666666666666667), (300, 0.612020202020202), (300, 0.3106060606060606), (300, 0.9995959595959596), (300, 0.5732323232323232), (300, 0.7272727272727273), (300, 0.6565656565656566), (300, 0.553030303030303), (300, 0.4670707070707071), (300, 0.38383838383838387), (300, 0.8771717171717172), (300, 0.547070707070707), (300, 0.8484848484848485), (299, 0.9389898989898989), (300, 0.5276767676767676), (300, 0.7070707070707071), (300, 0.7485858585858586)} Traceback (most recent call last): File "C:\Users\veham18\figaro\figaro\figaro.py", line 218, in main() File "C:\Users\veham18\figaro\figaro\figaro.py", line 210, in main resultTable, forwardCurve, reverseCurve = trimParameterPrediction.performAnalysisLite(parameters.inputDirectory.value, parameters.minimumCombinedReadLength.value, subsample = parameters.subsample.value, percentile = parameters.percentile.value, forwardPrimerLength=parameters.forwardPrimerLength.value, reversePrimerLength=parameters.reversePrimerLength.value, namingStandardAlias=fileNamingStandard) File "C:\Users\veham18\figaro\figaro\trimParameterPrediction.py", line 448, in performAnalysisLite forwardReadLength, reverseReadLength = checkReadLengths(fastqList) File "C:\Users\veham18\figaro\figaro\trimParameterPrediction.py", line 407, in checkReadLengths raise fastqHandler.FastqValidationError("Unable to validate fastq files enough to perform this operation. Please check log for specific error(s).") fastqHandler.FastqValidationError: Unable to validate fastq files enough to perform this operation. Please check log for specific error(s).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Zymo-Research/figaro/issues/37 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEYNLLY2DUIQ5U74FEWVV3TNPOC7ANCNFSM442URRMA . https://github.com/notifications/beacon/ACEYNLKJ4QL6GAAPEESXHGTTNPOC7A5CNFSM442URRMKYY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4NI436JQ.gif

vehamel commented 3 years ago

Hi!

No, I tried to trimmed them, but I give figaro the original files. So, no they were not trimmed. But, yes, it seems they are of various length (299 or 300), which I think is kind of expected no, one nucleotide difference is not a big difference ... What can I do about that?

janetw commented 3 years ago

Hello, I too am trying to use figaro for the first time and have been able to get it to now run but am getting a similar output. These reads were already trimmed of primers and barcodes. Since we used phasing in our primers, I am not surprised that I have varied lengths of forward and reverse reads. Does figaro require reads to be of the same length?

vehamel commented 3 years ago

Hello!

Me too! I will need to remove first part of the sequences because they must be primers. I forget to do it and now I was thinking to change that to my script!

vehamel commented 3 years ago

Hi!

I cannot still use the tool! Can you help me?

janetw commented 3 years ago

Hello, I was able to get FIGARO to work by first running fastqc and multiqc to determine the length that I wanted to trim to and make all reads the same length. I then used trimmomatic to get all the reads the same length. Trimmomatic has the option to crop at a certain length and drop reads that are shorter or you can choose to crop at the shortest sequencing read length; that's what I did. I then used FIGARO on the trimmed reads and once reads were all a consistent length, it ran fine. Hope this is helpful.

vehamel commented 3 years ago

Hello!

I understand! But it is not the goal of using Figaro to uptimize where we should trim our sequences? Maybe I don't understand correctly?!

janetw commented 3 years ago

Hello, FIGARO helps to choose parameters for the filterAndTrim function in DADA2. For FIGARO to work, however, the reads going into it must be one consistent length. So for example, I had reads that ranged from 269-281 bases. I cropped all reads to 269 and then used those trimmed reads in FIGARO. The output of FIGARO then provided what it determined to be optimal settings for the truncLen and maxEE settings in DADA2. I still am hoping that eventually FIGARO will be able to handle varying lengths.

vehamel commented 3 years ago

Thanks a lot for the explanation! I will try that ;)

michael-weinstein commented 3 years ago

Thanks for the community support. Sorry for being away for a bit, new baby over the last few weeks has been keeping me occupied. I agree very much with the approach above: if your reads only differ by a slight bit of length (a few bases here and there), just pretrim them to the shortest length, since you don't want to be selecting trimming parameters that are in the area where trimming may have happened to some reads. If your reads differ in length by a lot due to quality trimming, I recommend not doing that quality trim, as the purpose of FIGARO is to optimize the DADA2 native quality trimming methods. In the case shown above where it looks like it's seeing reads vary between 299 and 301 length, just trim it all down to 297 or even 295 to be safe. It's unlikely you'd be wanting to retain those last few bases anyway.

janetw commented 3 years ago

Well, now I am wondering, don't you have to trim to a consistent length in order to use FIGARO? Or is that now not the case? Thanks!

michael-weinstein commented 3 years ago

All the reads in a given direction should be the same length going into FIGARO. Forward and reverse reads don’t need to be the same length, but all the forward and all the reverse reads should be the same length.

From: Janet @.> Sent: Wednesday, June 2, 2021 7:56 PM To: Zymo-Research/figaro @.> Cc: Michael Weinstein @.>; Comment @.> Subject: Re: [Zymo-Research/figaro] Errors unsing figaro with v3-v4 sequencing file MiSeq (#37)

Well, now I am wondering, don't you have to trim to a consistent length in order to use FIGARO? Or is that now not the case? Thanks!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Zymo-Research/figaro/issues/37#issuecomment-853522844 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ACEYNLKQF3AKIKHZM3IV2TDTQ3VM5ANCNFSM442URRMA . https://github.com/notifications/beacon/ACEYNLKYLMMSY5V62MDZN5LTQ3VM5A5CNFSM442URRMKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOGLP3THA.gif

janetw commented 3 years ago

Thanks!

BrendaAmairanibp commented 2 years ago

Hello, FIGARO helps to choose parameters for the filterAndTrim function in DADA2. For FIGARO to work, however, the reads going into it must be one consistent length. So for example, I had reads that ranged from 269-281 bases. I cropped all reads to 269 and then used those trimmed reads in FIGARO. The output of FIGARO then provided what it determined to be optimal settings for the truncLen and maxEE settings in DADA2. I still am hoping that eventually FIGARO will be able to handle varying lengths.

Hi Janetw, reading your comments really helped me going trhough my illumina v3-v4 data but I have some troubles and doubts for trimming my sequences in to a same lenght; since there are no adapters in my fastq files I supposed I only have to use de command "CROP" in trimmomatic, Is this correct? Hopping you can help me.

handibles commented 2 years ago

Brenda, yup passing CROP:220 to trimmomatic will cut the 3' to 220bp. For trimming the 5' end, see HEADCROP in the Trimmomatic ref manual

cutadapt will also trim 3' bases from reads to a fixed length, e.g. -l 220 or --length 220 for 220bp length.

@michael-weinstein congrats on the sprog! :smiley_cat:

edit: cutadapt option --minimum-length / -m will remove reads shorter than the value specified. Have resorted to passing both -l 220 and -m 220 to strictly enforce all reads being one length, as reads lengths vary slightly at the best of times. Use another run at FastQC/MultiQC to check your lengths & outputs.

Remember, and as above, FIGARO doesn't need F and R reads to be the same length, so QC / trim them separately if it helps you retain more of your sequence (e.g. F is uniform 300 but R is 294-300 - do R only).