DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
138 stars 18 forks source link

BUG?: KeyError(s) when running read2tree #58

Open almosnow opened 4 months ago

almosnow commented 4 months ago

Hello,

I am trying to use read2tree, able to install it and run it, the example runs without a hitch and I get the expected output files.

When I try to use my sequences though, I was getting many "Invalid marker group" errors, this was relatively straightforward to take care of, I just renamed the fasta header lines accordingly.

Now I cannot get past an error that reads KeyError: 'U1810' in particular,

I think this definitely has to do with the five letter you use/infer, but don't really know how to make it work properly,

Any ideas?

sinamajidian commented 4 months ago

Hi @almosnow

Thanks for reaching out. Could you please give us more information on how you get the gene markers? would be great if you could share with us the mplog.log file and the command line(s) you use.

One thing is that the fasta record ID of gene markers in both amino acid level (OGXX.fa files) and nucleotide level should match. This is needed when you concatenate all fna files as dna_ref.fa and provide read2tree with --dna_reference dna_ref.fa. Otherwise, read2tree uses RestAPI to download them from OMA web browser assuming that the gene markers are downloaded from the OMA web browser.

Best, Sina

almosnow commented 4 months ago

Hmm, ok I see, I did not set up the gene markers properly I think.

Actually, now that I've read more, what I did was completely wrong.

Here's my scenario, perhaps you can advice on what to do.

We have a set of ~15 sequences (coding sequences from the same gene and the same organism, different samples around the world), with minor variations between them, a phylogeny shows two major groups distinct of each other (but changes between them are small, SNPs and the like).

We have another set of a few hundred SRA libraries and we would like to find out to which of the aforementioned 15 sequences they are most similar to.

Is it ok to use those initial 15 sequences as marker genes and try to fit the reads into them?

sinamajidian commented 3 months ago

For this case, Read2tree can generate a tree in Multiple species mode. However, one gene might not be enough to describe the evolution of organism or provide enough resolution for distinguishing all samples.

Anyway, you can put the amino acid sequences in a fasta file in the marker_genes folder and the nucleotide sequences of coding regions (with exact order) in another fasta file, mentioned with --dna_reference genes.nuc.fa . Note that the gene names should match in both files. Each starts with a five letter code for each strain, like this

>ASTMX02439
>PYGNA12763 
>ELEEL42119