DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
142 stars 18 forks source link

OGs did not have any DNA #36

Open Alfred-RRu opened 1 year ago

Alfred-RRu commented 1 year ago

Hello,

Thank you first for your interesting work.

I'm trying to apply your method to some analyses over corals. OMA only contain few distant species of cnidarians, so I followed the steps in "obtaining marker genes for viral dataset", obtaining the CDS and cDNA of three species related to the ones I'm studiying.

After obtaining 16135 OGS, OMA was run. Nevethelss, when this OGS are compared with my sequences the message "This OG did not have any DNA" for each OG.

After the analysis, the next lines are showed: Traceback (most recent call last): File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 292, in _align_worker align.dna = self._get_translated_alignment(codons, alignment) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 216, in _get_translated_alignment codon = codons[rec.id] KeyError: '04095' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree') File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script self.require(requires)[0].run_script(script_name, ns) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script exec(script_code, namespace, namespace) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 291, in main File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 51, in init File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 330, in _align File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 367, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value KeyError: '04095' (base) alfred@adn10:/media/Data/Alfred/Read2tree$ read2tree --standalone_path /media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup Traceback (most recent call last): File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree') File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script self.require(requires)[0].run_script(script_name, ns) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script exec(script_code, namespace, namespace) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 287, in main File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 22, in init File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 64, in _check_oma_output_path FileNotFoundError: [Errno 2] No such file or directory: '/media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup' (base) alfred@adn10:/media/Data/Alfred/Read2tree$

The resulting outputs file doesn't contain any aligns or three files.

Might it be a problem with the installation of the software or is related with the OGs obtained from the species I selected?

Thank you

sinamajidian commented 1 year ago

Dear @Alfred-RRu Thanks for using read2tree. It is important that the gene ids follow specific pattern. Could you possibly tell us the acession number of (or links to) the dna and cds of species that you use with OMA? I'll try to reproduce the error. Best regards, Sina

Alfred-RRu commented 1 year ago

Dear @sinamajidian

Thank for answering.

I use the following links:

The process to obtain marker genes took 10 days with these. I don't know if that is common.

Some of the sequences analyzed can be found on NCBI GenBank. I write you here the SRA accession numbers in case any it helps.

Best regards,

Alfred

sinamajidian commented 1 year ago

I used the same faa and fna files that you mentioned but I couldn't reproduce the error yet. could you possibly share the file OG1.fa in the marker_genes folder with us here?

1) I put the unzipped downloaded faa and fna files in data folder.

$ ls data/
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds.faa GCF_022113875.1_Hydra_105_v3_cds_from_genomic.fna  GCF_009602425.1_ASM960242v1_cds_from_genomic.fna   GCF_022113875.1_Hydra_105_v3_translated_cds.faa  GCF_001417965.1_Aiptasia_genome_1.1_cds_from_genomic.fna  GCF_009602425.1_ASM960242v1_translated_cds.faa

2) Then I ran clean_fasta

mkdir DB
python clean_fasta_cdna_cds.py data DB data/all_cdna_out.fa 

This resulted in

$ ls data/ | grep -v "GCF"
all_cdna_out.fa   five_letter_species.tsv   
 $ ls DB/
02425.fa  13875.fa  17965.fa
$ head -n2 DB/02425.fa 
>02425|NW.022257946.1.XP.031553244.1.1
GDRLSKGARQKWTTLTKLDFITLTKSAQSHCFTTQNTQSPQSSSLLRRRKHALPRPLHGE

$ head -n 2 data/all_cdna_out.fa 
>02425|NW.022257946.1.XP.031553244.1.1
GGTGACAGACTATCGAAGGGAGCGAGACAAAAATGGACTACATTGACAAAATTGGACTTC

$ cat data/five_letter_species.tsv
GCF_022113875.1_Hydra_105_v3_translated_cds 13875
GCF_009602425.1_ASM960242v1_translated_cds  02425
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds  17965

3) Then I ran OMA

OMA=/OMA.2.5.0/bin/oma
${OMA} -p
# OutgroupSpecies := ['13875'];

# we usually use HPC array of 500. Still it took some hours.
${OMA} -s

${OMA} 

This resulted in

 $ grep OutgroupSpecies parameters.drw  
OutgroupSpecies := ['13875'];
# Outgro 
# ... 

$ ls myWorkingDir/Output/OrthologousGroupsFasta/ | head -n 2
OG10000.fa
OG10001.fa
OG10002.fa

$ head OG10000.fa
>02425|NW.022261782.1.XP.031572670.1.25016 [02425]
MVQQIEVDEILFEGDIVLDEKNRDVDEPQEQKRNARRDRSKIWKTKVVPYVIDQSLRSPTDAGPHIQQAIAEFHKHTCVK

The OG ids could be different in different run. But the format of the ID of fasta record is important.

4) To make the process faster I only copied top 10 biggest OGs.

ls -althS myWorkingDir/Output/OrthologousGroupsFasta  | head -n 10
mkdir marker_genes
cp   myWorkingDir/Output/OrthologousGroupsFasta/ XX marker_genes  # replace XX with those 10 files.

5) Then I ran the first step of read2tree (finished in 1 minutes )

read2tree  --standalone_path  marker_genes  --output_path output --reference --dna_reference  data/all_cdna_out.fa  

the output log

$ read2tree  --standalone_path  marker_genes  --output_path output --reference --dna_reference  ../data/all_cdna_out.fa  
--- Load OGs with min 0 species from oma marker_genes - mode = marker_genes ---
Loading files for pre-filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 5799.25 OGs/s]
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from ../data/all_cdna_out.fa ---
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - Loading all_cdna_out.fa into memory. This might take a while . . . 
Loading OGs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 1135.44 OGs/s]
2023-08-18 10:10:08,205 - read2tree.OGSet - INFO - : Gathering of DNA seq for 8 OGs took 0.007969141006469727.
--- Generating reference for mapping ---
Loading records: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 113743.84 record/s]
2023-08-18 10:10:08,235 - read2tree.ReferenceSet - INFO - : Extracted 3 reference species form 8 ogs took 0.0005331039428710938
--- Alignment of 8 OGs ---
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:961: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stdout = io.open(c2pread, 'rb', bufsize)
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:966: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
  self.stderr = io.open(errread, 'rb', bufsize)
2023-08-18 10:10:22,284 - read2tree.Aligner - INFO - : Alignment of 8 OGs took 14.047221422195435.

this resulted in

$ ls
marker_genes  mplog.log  output
 ls output/
01_ref_ogs_aa  01_ref_ogs_dna  02_ref_dna  03_align_aa  03_align_dna
 $ ls marker_genes/
OG105.fa  OG14.fa  OG15.fa  OG17.fa  OG19.fa  OG1.fa  OG29.fa  OG6.fa

(I also ran for all OGs, and it looks fine: read2tree.Aligner - INFO - : Alignment of 18094 OGs took 2122 seconds)

could you possibly share the file OG1.fa in the marker_genes folder with us here?

You could also run step 4 and 5 again in a new folder.

Alfred-RRu commented 1 year ago

Hello Sina,

Sorry for the delay in the answer. I've been checking the first analysis I made, and I have find a mistake in the faa and fna data. Since this I'am running the analysis as you do, to obtain the same results, but my computing capacity seems to be less powerful than yours as it seems to take 300 hours to run OMA, even when -s is used. Or is this maybe indicating a trouble with my analysis?

Once I have the results I'll check them in comparison with yours, and proceed, if everything is okay, with the next steps as you said.

Nevertheless, I understand it will take more time since I'll include the SRA data in the analysis.

Once everything is finish I will be back to you.

Thank you so much.

Alfred

P.S: I have include my marker_genes folder here in case it helps. marker_genes.zip