flass / cpAAI_Rhizobiaceae

pipeline and reference protein sequence data for generating core-proteome alignment of Rhizobiaceae genomes
GNU General Public License v3.0
0 stars 2 forks source link

labels (fasta headers) are not orderred the same in input files #4

Closed kuzman1306 closed 1 year ago

kuzman1306 commented 2 years ago

Hi Florent,

I used the script for a new dataset (119 querygenomes) but I got the error shown below.

sudo python3.8 pipeline/genome2cpAAI.py -q 1/list.txt -p data/protein_sequences_list -o genome2cpAAI_1_out  --threads 8 --tmp_dir tmp --clean_prevtmp
...
aligning extracted protein sequences for marker 39264_pepQ together with input reference sequences
concatenating the marker protein alignments
Traceback (most recent call last):
  File "pipeline/genome2cpAAI.py", line 273, in <module>
    main(outdir, nflnfmarkgeneseqs=nflnfmarkgeneseqs, nflnfmarkprotseqs=nflnfmar     kprotseqs, nflnfquerygenomes=nflnfquerygenomes, nflnfqueryproteomes=nflnfquerypr     oteomes, nflnfmarkgenealns=nflnfmarkgenealns, nflnfmarkprotalns=nflnfmarkprotaln     s, tmpdir=tmpdir, nbthreads=nbthreads, aligner=aligner, cleanres=cleanres, clean     tmp=cleantmp, cleanaft=cleanaft, reusetmp=reusetmp, verbose=verbose)
  File "pipeline/genome2cpAAI.py", line 202, in main
    nextlabel = iterOneLabel(lfinhandles, foutconcatprotaln, currlabel)
  File "pipeline/genome2cpAAI.py", line 27, in iterOneLabel
    raise IndexError("{}\n{}\nlabels (fasta headers) are not orderred the same i     n input files".format(line, currlabel))
IndexError: >Allorhizobium_oryziradicis_N19 translation of Genus species Unclass     ified. [631069..632905]

>Agrobacterium_rubi_TR3
labels (fasta headers) are not orderred the same in input files

I hope you could check what is causing it.

Thank you in advance.

Cheers,

Nemanja

flass commented 2 years ago

Hi Nemanja,

thank you for posting this issue. I could replicate this bug thanks to your input files.

I think the problem is that you used input genomes that are called the same exact name as organisms already present in the reference set.
I believe this, combined with the dodgy approach I used in the code to indexing sequences based on the fasta sequence header name truncated at the first whitespace, leads to the bug.

So to fix this rapidly, I recommend you rename your input genomic fasta files just so they're not the same name as those we used for the reference set; for instance you could change Allorhizobium_oryziradicis_N19.fasta to Allorhizobium_oryziradicis_strain_N19.fasta.

However, I note that you could also just reduce your input dataset to avoid duplication of strains that are already present in the reference dataset - I think that would save you a lot of computation time!

You being the person having picked the refrence genomes in the original paper, I'm sure you know which ones they are (I expect they're actually the same files you're reusing!) but for the benefit of other users falling potentially in the same trap, I will recommend to have a look at the list of reference genomes to be found through the paper Kuzmanović et al. 2022 and specifically on Figshare there: https://microbiology.figshare.com/articles/dataset/Taxonomy_of_Rhizobiaceae_revisited_proposal_of_a_new_framework_for_genus_delimitation/17076455/1?file=34177917

Finally, I may now do a few changes to the code so that in the future it is robust to that kind of situation.

I hope this helps!

Cheers,

Florent

flass commented 2 years ago

This vulnerability should be addressed in f3b375c. @kuzman1306 you can try either the fixed version of the code (would be grateful for validation) and/or remove redundant sequences in your query genome set. I'll close this for now but please let me know if anything goes wrong. Cheers F

kuzman1306 commented 2 years ago

Hi Florent,

Thank you for your update.

I removed redundant sequences from the list. However, I am getting the following error:

...
concatenating the marker protein alignments
Traceback (most recent call last):
  File "pipeline/genome2cpAAI.py", line 290, in <module>
    main(outdir, nflnfmarkgeneseqs=nflnfmarkgeneseqs,
nflnfmarkprotseqs=nflnfmarkprotseqs, nflnfquerygenomes=nflnfquerygenomes,
nflnfqueryproteomes=nflnfqueryproteomes,
nflnfmarkgenealns=nflnfmarkgenealns, nflnfmarkprotalns=nflnfmarkprotalns,
tmpdir=tmpdir, nbthreads=nbthreads, aligner=aligner, cleanres=cleanres,
cleantmp=cleantmp, cleanaft=cleanaft, reusetmp=reusetmp, verbose=verbose)
  File "pipeline/genome2cpAAI.py", line 219, in main
    nextlabel = iterOneLabel(lfinhandles, foutconcatprotaln, currlabel)
  File "pipeline/genome2cpAAI.py", line 27, in iterOneLabel
    raise IndexError("{}\n{}\nlabels (fasta headers) are different/orderred
differently from in input files".format(line, currlabel))
IndexError: >Rhizobium_croatiense_13T translation of Genus species
Unclassified. [276255..277665] (reverse complement)

>Rhizobium_sp._XXXXXX
labels (fasta headers) are different/orderred differently from in input
files

Regards,

Nemanja

flass commented 2 years ago

OK that's unfortunate. can you please send the new input list file of the genomes that you included (privately by email)?

kuzman1306 commented 2 years ago

Hi Florent,

Thank you for your updates.

I removed redundant sequences, and tested the script by using WGSs, as well as fasta files containing CDSs predicted by prokka. However, I was still getting the similar errors as ones described earlier. I noticed then that one strain does not contain one of the 170 marker genes. It caused the gap in the final alignment after concatenation. I modified protein_sequences_list and protein_alignments_list and excluded this gene and repeated the analysis.

For the CDSs dataset, there were no error messages and I obtained results similar as in our joint paper Kuzmanović et al. 2022 (deviations around 0.1 %). However, for the contigs dataset, I got the following error:

....
concatenating the marker protein alignments
Traceback (most recent call last):
  File "pipeline/genome2cpAAI.py", line 290, in <module>
    main(outdir, nflnfmarkgeneseqs=nflnfmarkgeneseqs, nflnfmarkprotseqs=nflnfmar   kprotseqs, nflnfquerygenomes=nflnfquerygenomes, nflnfqueryproteomes=nflnfquerypr   oteomes, nflnfmarkgenealns=nflnfmarkgenealns, nflnfmarkprotalns=nflnfmarkprotaln   s, tmpdir=tmpdir, nbthreads=nbthreads, aligner=aligner, cleanres=cleanres, clean   tmp=cleantmp, cleanaft=cleanaft, reusetmp=reusetmp, verbose=verbose)
  File "pipeline/genome2cpAAI.py", line 219, in main
    nextlabel = iterOneLabel(lfinhandles, foutconcatprotaln, currlabel)
  File "pipeline/genome2cpAAI.py", line 27, in iterOneLabel
    raise IndexError("{}\n{}\nlabels (fasta headers) are different/orderred diff   erently from in input files".format(line, currlabel))
IndexError: >Rhizobium_freirei_PRF_81 translation of Genus species Unclassified.    [562667..563636]

>Rhizobium_yanglingense_LMG_19592
labels (fasta headers) are different/orderred differently from in input files

Cheers,

Nemanja

flass commented 2 years ago

Dear Nemanja,

thank you for your update, that really helped identifying the bug.

it appears that if there is no tblastn hit to a protein marker in a query genome, no sequence will be added for that genome to the output extracted sequence fasta file (and then the aligned sequence fasta file); as a result, this lead tothe obsevred bug when making the concatenante at the end of the script.

This is now fixed with d89c38e. I tested this with your dataset, using either just the protein marker sequence as input (option -p) or also with providing the aligned protein marker sequences (options -p and -A), and it works.

I can send you the concatenated alignments if you want to evaluate the cpAAI values, or you can re-run it on your side.

I'll close this for now but please let me know if it still does not work. Cheers! Florent

kuzman1306 commented 1 year ago

Hi Florent,

Thank you for modifying the script. The pipeline finalized the run (options -p and -A) without the error. However, when I checked the alignment, the most of the aa places were missing for strains included into analysis. Was it also the case for your run?

Regards,

Nemanja

flass commented 1 year ago

Dear Nemanja,

sorry for the delay, as I was a bit quick to delete all the results from the runs, I had to re-run the program to reproduce the problem. indeed I see the same probblem.... puzzling... I will investigate it.

flass commented 1 year ago

Hi Nemanja,

i think I found the bug! turns out it's in your data, not in my code ;-)

it's because input contigs all carry the same id as specified by the headers of sequences in the contig fasta files:

>Genus species Unclassified.

(in practice, this means that all sequences have Genus as their id as fasta headers are trimmed at the first whitespace by Blast and BioPython other applications)

this duplication of ids means that incorrect source sequence data is used when trying to extract gene and then protein sequences, and often the coordiantes do not exist in the template sequnce so an empty sequence is returned, and that's why we end up with close to nothing (or sometimes non-homologous sequence) in the marker concatenate.

please try renaming the fasta headers in your input and I think that should solve the issue

kuzman1306 commented 1 year ago

Hi Florent,

I finally tested the script. I changed fasta header names, and it works now. Thank you for your efforts on solving this issue.

Cheers,

Nemanja