Bad File Format - Githubissues

chaoszhang / ASTER

Accurate Species Tree EstimatoR series: a family of optimation algorithms for species tree inference implemented in C++ (including ASTRAL-Pro & Weighted ASTRAL)

GNU Affero General Public License v3.0

84 stars 9 forks source link

Bad File Format #28

Open dylanHco opened 2 weeks ago

dylanHco commented 2 weeks ago

Hello - I am trying to use Waster to generate a tree for input into cactus aligner. I am trying to align chloroplast genomes from several different species. Instead of using raw fastq files, I am first taking them thru GetOrganelle (https://github.com/Kinggerm/GetOrganelle) to extract just reads associated with chloroplast in order to build a tree. GetOrganelle outputs paired in reads as fastqs that are only associated with the chloroplast. I use BBMerge to merge together the paired end files. However when I try to use these with waster, I get Bad Format errors and I am not sure why that is. Below is the first few lines of a merged fasta:

LH00305:47:22J3KWLT3:1:1106:31984:20819 1:N:0:GAATTAGT+GTCAGTAC TTTCGCTACAGCACCCGCTGCTCTAGCTAATTGTCCACCCTTTCCAAGTGTGATTTCTATGTTATGTATG GCCGTGCCTAAGGGCATATCGGTTGAAGTAGATTC LH00305:47:22J3KWLT3:1:1107:33731:12248 1:N:0:GAATTAGT+GTCAGTAC ATTCTATCTAACGAATGAGACTTCTATGGATCTATCCCATTTTTTCGGGTTATCCAAAAGAGTTTAATTA TTAATTACATGAGTTTCAAACTTGAATTTGGATTCCTAAT LH00305:47:22J3KWLT3:1:1109:18790:4269 1:N:0:GAATTAGT+GTCAGTAC TTTAATTATTATTTTTGATATTTTATTTTGTAGGATAGAGTCAAAACTTATCCTAAGTTCCCCAAATTAG ACCAACGGAATTCTGTTTGCTATATTATATAAAAAAGTGCTTCTGAATTAATCTCATCTT LH00305:47:22J3KWLT3:1:1109:44115:6367 1:N:0:GAATTCGT+GTCAGTAC GAGTTCATTCTCCGGGAAACTCCGTTTAAATTATTCCGGTGGATTCTTTACAACCTACTTCTTTTATTAT CTCATTGGAAATCATATAAAGACAATTCCTATTTAATATAGCTAT LH00305:47:22J3KWLT3:1:1109:37125:16301 1:N:0:GAATTAGT+GTCAGTAC CGCACTTCTAACACTTGTTCCACTTTTGGAAGACCCTGCGTTATGTCCCCAGATCTCGACTTTTCATATA TAAATGTAACTAATGTATCTCCTTCATAAAGGATTTCCCCATAATGGCCATGAACGGTTGCTCCTGGGGT GGCCAAATAAGGCCTAGCTGATCGTATTACTACGGAATCGACTTGAACAAATATAA LH00305:47:22J3KWLT3:1:1109:20361:19265 1:N:0:GAATTAGT+GTCAGTAC ATATAATCCCATAGACCTCCTTTAAGAATTCCAATCTGGAAAAAGAATTGATAGCTTGTATTTCGGTTGT ATCAATTATCATTTTTAACGATCAACTTCTCCCATAATGATATCTATGCTACCTAATATCGTCATAATAT CAGCCAATTTCATTCTTTTAA LH00305:47:22J3KWLT3:1:1109:43708:28927 1:N:0:GAATTAGT+GTCAGTAC TTTTCTTCTTCCATATGTAAAAAGGGAATAAACAAATCAATCAAATTCCGGGATGCTTCATGAAGTGCTT CTTTCGGAGTTAAACTTCCGTTTGTCCATATTTCTAGAAAAAGTATCTCTTGTTTTTCATTCCCATTCCC ATAAGAAAGAATACTATGATTTGCATTTCGAACAGGCA

Thanks for any suggestions! Dylan

chaoszhang commented 2 weeks ago

Please send me the command, direct input file (this should be a list of fasta files), and the log. This will help me diagnose the problem. Thanks.

dylanHco commented 1 week ago

Here is the command: /projects/p31913/ASTER/bin/waster-site -i in3 -u 1 -t 4 -k 8 -o guidetest1.tre 2>a2.log

Log out: Without-Alignment/Assembly Species Tree EstimatoR † (site) Version: v1.16.1.0 Make sure you have run 'waster-site -h', read about '-k' command, and ensured you have enough memory to proceed! Quality control: Masking all SNP bases with quality lower than '?' for FASTQ inputs. Quality control: Masking all non-SNP bases with quality lower than '5' for FASTQ inputs. Species A_longiflora_ENG_S44 is selected to count the most frequent patterns. Hash table 0% filled. Species /projects/p31913/Trim_outs/A_palmeri_1_S48/*.fq is selected to count the most frequent patterns. File A_palmeri_5_S43 bad format!

chaoszhang commented 1 week ago

Currently waster does not support *.fq, if you have multiple files for the same sample, please cat them into one file.

dylanHco commented 1 week ago

I have tried that too - and I get the same error.

chaoszhang commented 1 week ago

Can I see the input and log file?

dylanHco commented 1 week ago

Without-Alignment/Assembly Species Tree EstimatoR † (site) Version: v1.16.1.0 Make sure you have run 'waster-site -h', read about '-k' command, and ensured you have enough memory to proceed! Quality control: Masking all SNP bases with quality lower than '?' for FASTQ inputs. Quality control: Masking all non-SNP bases with quality lower than '5' for FASTQ inputs. Species /projects/p31913/Trim_outs/A_tabernaemontana_repens_S31/A_tabernaemontana_repens_S31.merged.fasta is selected to count the most frequent patterns. File A_tharpii_P33_S26 bad format!

chaoszhang commented 1 week ago

I see. This is maybe counter-intuitive, but in your input file try the following format instead:

/projects/p31913/Trim_outs/A_arenaria_TX7_S43/A_arenaria_TX7_S43.merged.fasta A_arenaria_TX7_S43 /projects/p31913/Trim_outs/A_ciliata_texanaH17_S47/A_ciliata_texanaH17_S47.merged.fasta A_ciliata_texanaH17_S47 ......

dylanHco commented 1 week ago

I still get the same error.

Without-Alignment/Assembly Species Tree EstimatoR † (site) Version: v1.16.1.0 Make sure you have run 'waster-site -h', read about '-k' command, and ensured you have enough memory to proceed! Quality control: Masking all SNP bases with quality lower than '?' for FASTQ inputs. Quality control: Masking all non-SNP bases with quality lower than '5' for FASTQ inputs. Species A_rigida_H10_S46 is selected to count the most frequent patterns. Hash table 0% filled. Species A_tomentosa_tomentosa_4_S29 is selected to count the most frequent patterns. Hash table 0% filled. Species /projects/p31913/Trim_outs/A_grandiflora_1_S32/A_grandiflora_1_S32.merged.fasta is selected to count the most frequent patterns. File A_grandiflora_1_S32 bad format!

chaoszhang commented 1 week ago

inputfileA.txt Try this input.

dylanHco commented 1 week ago

Without-Alignment/Assembly Species Tree EstimatoR † (site) Version: v1.16.1.0 Make sure you have run 'waster-site -h', read about '-k' command, and ensured you have enough memory to proceed! Quality control: Masking all SNP bases with quality lower than '?' for FASTQ inputs. Quality control: Masking all non-SNP bases with quality lower than '5' for FASTQ inputs. Species /projects/p31913/Trim_outs/A_tabernaemontana_repens_S31/A_tabernaemontana_repens_S31.merged.fasta is selected to count the most frequent patterns. File A_tabernaemontanaB_repens_S31 bad format!

chaoszhang commented 1 week ago

inputfileA.txt Weird. What about this one?

dylanHco commented 1 week ago

Without-Alignment/Assembly Species Tree EstimatoR † (site) Version: v1.16.1.0 Make sure you have run 'waster-site -h', read about '-k' command, and ensured you have enough memory to proceed! Quality control: Masking all SNP bases with quality lower than '?' for FASTQ inputs. Quality control: Masking all non-SNP bases with quality lower than '5' for FASTQ inputs. Species A_tabernaemontanaB_repens_S31 is selected to count the most frequent patterns. Hash table 0% filled. Species A_tabernaemontanaA_H9_S39 is selected to count the most frequent patterns. Hash table 0% filled. Species A_rigida_H10_S46 is selected to count the most frequent patterns. Hash table 0% filled. Species A_ciliata_texanaH17_S47 is selected to count the most frequent patterns. Hash table 0% filled. Species A_hubrichtii_2_S30 is selected to count the most frequent patterns. Hash table 0% filled. Species A_tomentosa_tomentosa_4_S29 is selected to count the most frequent patterns. Hash table 0% filled. Species A_mystery_2_S42 is selected to count the most frequent patterns. Hash table 0% filled. Species A_palmeriB_S48 is selected to count the most frequent patterns. Hash table 0% filled. Species A_tharpii_P33_S26 is selected to count the most frequent patterns. Hash table 0% filled. Species Rhazya_stricta_S10 is selected to count the most frequent patterns. Hash table 0% filled. Species A_longiflora_OV_S11 is selected to count the most frequent patterns. Hash table 0% filled. Species A_ciliata_texana_S6 is selected to count the most frequent patterns. Hash table 0% filled. Species A_grandiflora_2_S36 is selected to count the most frequent patterns. Hash table 0% filled. Species A_fugatei_F25_S32 is selected to count the most frequent patterns. Hash table 0% filled. Species A_rigida_3_S15 is selected to count the most frequent patterns. Hash table 0% filled. Species A_longiflora_ENG_S44 is selected to count the most frequent patterns. Hash table 0% filled. Species A_arenaria_TX7_S43 is selected to count the most frequent patterns. Hash table 0% filled. Species A_fugatei_2_S11 is selected to count the most frequent patterns. Hash table 0% filled. Species A_kearnyana_2CKE_S18 is selected to count the most frequent patterns. Hash table 0% filled. Species A_palmeriA_S4 is selected to count the most frequent patterns. File /projects/p31913/Trim_outs/A_palmeri_1_S4/A_palmeri_1_S4.merged.fasta bad format!

chaoszhang commented 1 week ago

Some progress. Please send me /projects/p31913/Trim_outs/A_palmeri_1_S4/A_palmeri_1_S4.merged.fasta if that is small enough. If it is very large, send me the first 10,000 lines.

dylanHco commented 1 week ago

Ok - so the previous file you sent me works. I was missing the merged.fasta in that folder. The program started to work after that. No other merged fastas were missing and so however you formatted it got it to work.