flass / cpAAI_Rhizobiaceae

pipeline and reference protein sequence data for generating core-proteome alignment of Rhizobiaceae genomes
GNU General Public License v3.0
0 stars 2 forks source link

(fasta headers) are not orderred the same in input files #1

Closed kuzman1306 closed 3 years ago

kuzman1306 commented 3 years ago

Hi,

with the command:

pipeline/genome2cpAAI.py -q list.txt -p protein_sequences_list -o run_genome2cpAAI --threads 8 --tmp_dir tmp --clean_prevtmp

I got the following error:

no marker gene/protein alignment provided, will have to align marker gene/protein sequences from scratch together with extracted input
cleaning: removing previous temporary files

Building a new DB, current time: 09/08/2021 18:32:33
New DB name:   /mnt/volume/cpAAI_Rhizobiaceae/cpAAI_Rhizobiaceae/data/tmp/blastdb/x.fasta
New DB title:  tmp/blastdb/x.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 4 sequences in 0.0562541 seconds.

Building a new DB, current time: 09/08/2021 18:36:35
New DB name:   /mnt/volume/cpAAI_Rhizobiaceae/cpAAI_Rhizobiaceae/data/tmp/blastdb/y.fasta
New DB title:  tmp/blastdb/y.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 56 sequences in 0.051487 seconds.

Building a new DB, current time: 09/08/2021 18:38:27
New DB name:   /mnt/volume/cpAAI_Rhizobiaceae/cpAAI_Rhizobiaceae/data/tmp/blastdb/z.fasta
New DB title:  tmp/blastdb/z.fasta
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Adding sequences from FASTA; added 22 sequences in 0.0509639 seconds.
aligning extracted protein sequences for marker 36156_rph together with input reference sequences
aligning extracted protein sequences for marker 36158_dITP-XTP_pyrophospha.. together with input reference sequences
aligning extracted protein sequences for marker 36160_dnaA_3 together with input reference sequences
aligning extracted protein sequences for marker 36163_mutM together with input reference sequences
aligning extracted protein sequences for marker 36165_ubiB together with input reference sequences
aligning extracted protein sequences for marker 36169_deoA together with input reference sequences
aligning extracted protein sequences for marker 36173_rfuD_2 together with input reference sequences
aligning extracted protein sequences for marker 36174_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 36198_dapB together with input reference sequences
aligning extracted protein sequences for marker 36440_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 36443_ravA together with input reference sequences
aligning extracted protein sequences for marker 36446_cca together with input reference sequences
aligning extracted protein sequences for marker 36476_prs_2 together with input reference sequences
aligning extracted protein sequences for marker 36480_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 36513_carA together with input reference sequences
aligning extracted protein sequences for marker 36514_yqeY together with input reference sequences
aligning extracted protein sequences for marker 36624_purQ together with input reference sequences
aligning extracted protein sequences for marker 36626_purC_1 together with input reference sequences
aligning extracted protein sequences for marker 36638_rpsD together with input reference sequences
aligning extracted protein sequences for marker 36644_purB together with input reference sequences
aligning extracted protein sequences for marker 36676_cckA together with input reference sequences
aligning extracted protein sequences for marker 36679_recA together with input reference sequences
aligning extracted protein sequences for marker 36692_ppiD together with input reference sequences
aligning extracted protein sequences for marker 36730_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 36733_sufC together with input reference sequences
aligning extracted protein sequences for marker 36734_sufD together with input reference sequences
aligning extracted protein sequences for marker 36735_csd together with input reference sequences
aligning extracted protein sequences for marker 36743_gcvH together with input reference sequences
aligning extracted protein sequences for marker 36775_pcs together with input reference sequences
aligning extracted protein sequences for marker 36778_rfuD_1 together with input reference sequences
aligning extracted protein sequences for marker 36779_Purine-binding_prote.. together with input reference sequences
aligning extracted protein sequences for marker 36790_ppiB together with input reference sequences
aligning extracted protein sequences for marker 36803_recG together with input reference sequences
aligning extracted protein sequences for marker 36808_glmU together with input reference sequences
aligning extracted protein sequences for marker 36812_appA_1 together with input reference sequences
aligning extracted protein sequences for marker 36817_ycfH together with input reference sequences
aligning extracted protein sequences for marker 36890_lexA_1 together with input reference sequences
aligning extracted protein sequences for marker 37004_xerC_2 together with input reference sequences
aligning extracted protein sequences for marker 37007_atpH together with input reference sequences
aligning extracted protein sequences for marker 37009_atpG together with input reference sequences
aligning extracted protein sequences for marker 37011_atpC together with input reference sequences
aligning extracted protein sequences for marker 37022_lpd3 together with input reference sequences
aligning extracted protein sequences for marker 37025_sucB together with input reference sequences
aligning extracted protein sequences for marker 37029_mdh together with input reference sequences
aligning extracted protein sequences for marker 37032_sdhB together with input reference sequences
aligning extracted protein sequences for marker 37038_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 37041_tsaD together with input reference sequences
aligning extracted protein sequences for marker 37070_rlmN together with input reference sequences
aligning extracted protein sequences for marker 37074_thpR together with input reference sequences
aligning extracted protein sequences for marker 37077_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 37111_thiN together with input reference sequences
aligning extracted protein sequences for marker 37123_fzlC together with input reference sequences
aligning extracted protein sequences for marker 37126_lon2 together with input reference sequences
aligning extracted protein sequences for marker 37128_ubiL_2 together with input reference sequences
aligning extracted protein sequences for marker 37155_fdxA together with input reference sequences
aligning extracted protein sequences for marker 37179_proB together with input reference sequences
aligning extracted protein sequences for marker 37184_rpmA together with input reference sequences
aligning extracted protein sequences for marker 37226_pal_1 together with input reference sequences
aligning extracted protein sequences for marker 37252_ompA together with input reference sequences
aligning extracted protein sequences for marker 37254_suhB_2 together with input reference sequences
aligning extracted protein sequences for marker 37259_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 37261_purE together with input reference sequences
aligning extracted protein sequences for marker 37266_ttuE together with input reference sequences
aligning extracted protein sequences for marker 37273_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 37418_eda together with input reference sequences
aligning extracted protein sequences for marker 37428_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 37646_ilvC together with input reference sequences
aligning extracted protein sequences for marker 37654_pdxJ together with input reference sequences
aligning extracted protein sequences for marker 37751_recJ together with input reference sequences
aligning extracted protein sequences for marker 38038_glnE together with input reference sequences
aligning extracted protein sequences for marker 38066_amn together with input reference sequences
aligning extracted protein sequences for marker 38082_smpB together with input reference sequences
aligning extracted protein sequences for marker 38098_ygfZ together with input reference sequences
aligning extracted protein sequences for marker 38105_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38106_psd together with input reference sequences
aligning extracted protein sequences for marker 38118_yciK together with input reference sequences
aligning extracted protein sequences for marker 38138_fabG_1 together with input reference sequences
aligning extracted protein sequences for marker 38142_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38144_rsmA together with input reference sequences
aligning extracted protein sequences for marker 38147_lptD together with input reference sequences
aligning extracted protein sequences for marker 38148_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38152_pepA_1 together with input reference sequences
aligning extracted protein sequences for marker 38153_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38161_moaD together with input reference sequences
aligning extracted protein sequences for marker 38162_pgsA together with input reference sequences
aligning extracted protein sequences for marker 38167_rlmJ together with input reference sequences
aligning extracted protein sequences for marker 38171_purM together with input reference sequences
aligning extracted protein sequences for marker 38173_dnaA_2 together with input reference sequences
aligning extracted protein sequences for marker 38175_gppA_1 together with input reference sequences
aligning extracted protein sequences for marker 38181_rnd_1 together with input reference sequences
aligning extracted protein sequences for marker 38189_bpt together with input reference sequences
aligning extracted protein sequences for marker 38190_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38228_argC together with input reference sequences
aligning extracted protein sequences for marker 38230_rpsI together with input reference sequences
aligning extracted protein sequences for marker 38318_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38332_yeeZ together with input reference sequences
aligning extracted protein sequences for marker 38353_aviRb together with input reference sequences
aligning extracted protein sequences for marker 38357_dnaJ_1 together with input reference sequences
aligning extracted protein sequences for marker 38371_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38374_murJ together with input reference sequences
aligning extracted protein sequences for marker 38380_mshD_1 together with input reference sequences
aligning extracted protein sequences for marker 38383_miaB together with input reference sequences
aligning extracted protein sequences for marker 38384_PhoH-like_protein together with input reference sequences
aligning extracted protein sequences for marker 38389_fmt_2 together with input reference sequences
aligning extracted protein sequences for marker 38390_truA together with input reference sequences
aligning extracted protein sequences for marker 38406_dapD together with input reference sequences
aligning extracted protein sequences for marker 38428_yidC together with input reference sequences
aligning extracted protein sequences for marker 38431_argF together with input reference sequences
aligning extracted protein sequences for marker 38443_dcd together with input reference sequences
aligning extracted protein sequences for marker 38460_scrK together with input reference sequences
aligning extracted protein sequences for marker 38513_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38514_hisD together with input reference sequences
aligning extracted protein sequences for marker 38578_puuA_2 together with input reference sequences
aligning extracted protein sequences for marker 38621_purA together with input reference sequences
aligning extracted protein sequences for marker 38725_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38737_gdh_1 together with input reference sequences
aligning extracted protein sequences for marker 38743_Pyridoxal_phosphate_.. together with input reference sequences
aligning extracted protein sequences for marker 38748_parA together with input reference sequences
aligning extracted protein sequences for marker 38757_aroE together with input reference sequences
aligning extracted protein sequences for marker 38759_dnaQ together with input reference sequences
aligning extracted protein sequences for marker 38762_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 38773_trpA together with input reference sequences
aligning extracted protein sequences for marker 38775_fpgS together with input reference sequences
aligning extracted protein sequences for marker 38782_ahcY together with input reference sequences
aligning extracted protein sequences for marker 38787_baeR together with input reference sequences
aligning extracted protein sequences for marker 38790_coaA_1 together with input reference sequences
aligning extracted protein sequences for marker 38791_hisE together with input reference sequences
aligning extracted protein sequences for marker 38792_hisF together with input reference sequences
aligning extracted protein sequences for marker 38793_hisA together with input reference sequences
aligning extracted protein sequences for marker 38796_hisH together with input reference sequences
aligning extracted protein sequences for marker 38799_hslU together with input reference sequences
aligning extracted protein sequences for marker 38809_hrpB together with input reference sequences
aligning extracted protein sequences for marker 38852_rplC together with input reference sequences
aligning extracted protein sequences for marker 38853_rplD together with input reference sequences
aligning extracted protein sequences for marker 38854_rplW together with input reference sequences
aligning extracted protein sequences for marker 38855_rplB together with input reference sequences
aligning extracted protein sequences for marker 38857_rplV together with input reference sequences
aligning extracted protein sequences for marker 38858_rpsC together with input reference sequences
aligning extracted protein sequences for marker 38859_rplP together with input reference sequences
aligning extracted protein sequences for marker 38861_rpsQ together with input reference sequences
aligning extracted protein sequences for marker 38863_rplX together with input reference sequences
aligning extracted protein sequences for marker 38864_rplE together with input reference sequences
aligning extracted protein sequences for marker 38865_rpsN together with input reference sequences
aligning extracted protein sequences for marker 38866_rpsH together with input reference sequences
aligning extracted protein sequences for marker 38867_rplF together with input reference sequences
aligning extracted protein sequences for marker 38868_rplR together with input reference sequences
aligning extracted protein sequences for marker 38871_rplO together with input reference sequences
aligning extracted protein sequences for marker 38872_secY together with input reference sequences
aligning extracted protein sequences for marker 38876_rpoA together with input reference sequences
aligning extracted protein sequences for marker 38882_alaS_2 together with input reference sequences
aligning extracted protein sequences for marker 38917_grxD together with input reference sequences
aligning extracted protein sequences for marker 38919_putative_protein_RP8.. together with input reference sequences
aligning extracted protein sequences for marker 38933_lpxL together with input reference sequences
aligning extracted protein sequences for marker 39002_ybaB together with input reference sequences
aligning extracted protein sequences for marker 39036_truB together with input reference sequences
aligning extracted protein sequences for marker 39040_xseA together with input reference sequences
aligning extracted protein sequences for marker 39045_upp together with input reference sequences
aligning extracted protein sequences for marker 39066_hypothetical_protein together with input reference sequences
aligning extracted protein sequences for marker 39095_pheT together with input reference sequences
aligning extracted protein sequences for marker 39103_rpsO together with input reference sequences
aligning extracted protein sequences for marker 39227_rsmH together with input reference sequences
aligning extracted protein sequences for marker 39231_murF together with input reference sequences
aligning extracted protein sequences for marker 39232_mraY together with input reference sequences
aligning extracted protein sequences for marker 39233_murD together with input reference sequences
aligning extracted protein sequences for marker 39236_murC together with input reference sequences
aligning extracted protein sequences for marker 39239_ddlB together with input reference sequences
aligning extracted protein sequences for marker 39241_ftsA together with input reference sequences
aligning extracted protein sequences for marker 39242_ftsZ_2 together with input reference sequences
aligning extracted protein sequences for marker 39262_prmC_2 together with input reference sequences
aligning extracted protein sequences for marker 39264_pepQ together with input reference sequences
concatenating the marker protein alignments
Traceback (most recent call last):
  File "pipeline/genome2cpAAI.py", line 270, in <module>
    main(outdir, nflnfmarkgeneseqs=nflnfmarkgeneseqs, nflnfmarkprotseqs=nflnfmarkprotseqs, nflnfquerygenomes=nflnfquerygenomes, nflnfqueryproteomes=nflnfqueryproteomes, nflnfmarkgenealns=nflnfmarkgenealns, nflnfmarkprotalns=nflnfmarkprotalns, tmpdir=tmpdir, nbthreads=nbthreads, aligner=aligner, cleanres=cleanres, cleantmp=cleantmp, cleanaft=cleanaft, reusetmp=reusetmp, verbose=verbose)
  File "pipeline/genome2cpAAI.py", line 199, in main
    nextlabel = iterOneLabel(lfinhandles, foutconcatprotaln, currlabel)
  File "pipeline/genome2cpAAI.py", line 26, in iterOneLabel
    raise IndexError("{}\n{}\nlabels (fasta headers) are not orderred the same in input files".format(line, currlabel))
IndexError: >z translation of NZ_LMVJ01000020.1 Agrobacterium tumefaciens strain NCPPB 3001 A_radiobacter_NCPPB3001_contig7, whole genome shotgun sequence [59231..60635] (reverse complement)

>y
labels (fasta headers) are not orderred the same in input files

Cheers,

Nemanja

flass commented 3 years ago

Hi Nemanja,

Your error is due to the order in which sequence appear in alignments not being the same between alignment. It should be the same as alignments are generated from files were sequences appear in the same order… but that relies on a default option of MAFFT, --inputorder It may be that you have a different version of MAFFT, or that on your platform this option is not the default. You can check with man mafft and look for --inputorder

To make things more stable in case it is the issue, I updated the code (commit ae04d60) so that this option is invoked explicitly.

Can you please try again with the updated code and let me know?

Cheers, Florent

kuzman1306 commented 3 years ago

Hi Florent,

My mafft version is 6.240 (--inputorder, Output order: same as input. Default: on). Platform: Ubuntu 18.04.4 LTS

Anyway, I tried the updated code, and it is functioning now!

Thank you very much for your support!

Best regards,

Nemanja

flass commented 3 years ago

Hi Nemanja, great to hear it works! I'll close this for now but please re-open if the issue re-occurs. Best, Florent