Open Alfred-RRu opened 1 year ago
Dear @Alfred-RRu Thanks for using read2tree. It is important that the gene ids follow specific pattern. Could you possibly tell us the acession number of (or links to) the dna and cds of species that you use with OMA? I'll try to reproduce the error. Best regards, Sina
Dear @sinamajidian
Thank for answering.
I use the following links:
The process to obtain marker genes took 10 days with these. I don't know if that is common.
Some of the sequences analyzed can be found on NCBI GenBank. I write you here the SRA accession numbers in case any it helps.
Best regards,
Alfred
I used the same faa and fna files that you mentioned but I couldn't reproduce the error yet. could you possibly share the file OG1.fa in the marker_genes
folder with us here?
1) I put the unzipped downloaded faa and fna files in data folder.
$ ls data/
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds.faa GCF_022113875.1_Hydra_105_v3_cds_from_genomic.fna GCF_009602425.1_ASM960242v1_cds_from_genomic.fna GCF_022113875.1_Hydra_105_v3_translated_cds.faa GCF_001417965.1_Aiptasia_genome_1.1_cds_from_genomic.fna GCF_009602425.1_ASM960242v1_translated_cds.faa
2) Then I ran clean_fasta
mkdir DB
python clean_fasta_cdna_cds.py data DB data/all_cdna_out.fa
This resulted in
$ ls data/ | grep -v "GCF"
all_cdna_out.fa five_letter_species.tsv
$ ls DB/
02425.fa 13875.fa 17965.fa
$ head -n2 DB/02425.fa
>02425|NW.022257946.1.XP.031553244.1.1
GDRLSKGARQKWTTLTKLDFITLTKSAQSHCFTTQNTQSPQSSSLLRRRKHALPRPLHGE
$ head -n 2 data/all_cdna_out.fa
>02425|NW.022257946.1.XP.031553244.1.1
GGTGACAGACTATCGAAGGGAGCGAGACAAAAATGGACTACATTGACAAAATTGGACTTC
$ cat data/five_letter_species.tsv
GCF_022113875.1_Hydra_105_v3_translated_cds 13875
GCF_009602425.1_ASM960242v1_translated_cds 02425
GCF_001417965.1_Aiptasia_genome_1.1_translated_cds 17965
3) Then I ran OMA
OMA=/OMA.2.5.0/bin/oma
${OMA} -p
# OutgroupSpecies := ['13875'];
# we usually use HPC array of 500. Still it took some hours.
${OMA} -s
${OMA}
This resulted in
$ grep OutgroupSpecies parameters.drw
OutgroupSpecies := ['13875'];
# Outgro
# ...
$ ls myWorkingDir/Output/OrthologousGroupsFasta/ | head -n 2
OG10000.fa
OG10001.fa
OG10002.fa
$ head OG10000.fa
>02425|NW.022261782.1.XP.031572670.1.25016 [02425]
MVQQIEVDEILFEGDIVLDEKNRDVDEPQEQKRNARRDRSKIWKTKVVPYVIDQSLRSPTDAGPHIQQAIAEFHKHTCVK
The OG ids could be different in different run. But the format of the ID of fasta record is important.
4) To make the process faster I only copied top 10 biggest OGs.
ls -althS myWorkingDir/Output/OrthologousGroupsFasta | head -n 10
mkdir marker_genes
cp myWorkingDir/Output/OrthologousGroupsFasta/ XX marker_genes # replace XX with those 10 files.
5) Then I ran the first step of read2tree (finished in 1 minutes )
read2tree --standalone_path marker_genes --output_path output --reference --dna_reference data/all_cdna_out.fa
the output log
$ read2tree --standalone_path marker_genes --output_path output --reference --dna_reference ../data/all_cdna_out.fa
--- Load OGs with min 0 species from oma marker_genes - mode = marker_genes ---
Loading files for pre-filter: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 5799.25 OGs/s]
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from ../data/all_cdna_out.fa ---
2023-08-18 10:10:06,257 - read2tree.OGSet - INFO - Loading all_cdna_out.fa into memory. This might take a while . . .
Loading OGs: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 1135.44 OGs/s]
2023-08-18 10:10:08,205 - read2tree.OGSet - INFO - : Gathering of DNA seq for 8 OGs took 0.007969141006469727.
--- Generating reference for mapping ---
Loading records: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 113743.84 record/s]
2023-08-18 10:10:08,235 - read2tree.ReferenceSet - INFO - : Extracted 3 reference species form 8 ogs took 0.0005331039428710938
--- Alignment of 8 OGs ---
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:961: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stdout = io.open(c2pread, 'rb', bufsize)
/work/FAC/FBM/DBC/cdessim2/default/smajidi1/software/miniconda/envs/r2t_3.10.8b/lib/python3.10/subprocess.py:966: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used
self.stderr = io.open(errread, 'rb', bufsize)
2023-08-18 10:10:22,284 - read2tree.Aligner - INFO - : Alignment of 8 OGs took 14.047221422195435.
this resulted in
$ ls
marker_genes mplog.log output
ls output/
01_ref_ogs_aa 01_ref_ogs_dna 02_ref_dna 03_align_aa 03_align_dna
$ ls marker_genes/
OG105.fa OG14.fa OG15.fa OG17.fa OG19.fa OG1.fa OG29.fa OG6.fa
(I also ran for all OGs, and it looks fine: read2tree.Aligner - INFO - : Alignment of 18094 OGs took 2122 seconds)
could you possibly share the file OG1.fa in the marker_genes
folder with us here?
You could also run step 4 and 5 again in a new folder.
Hello Sina,
Sorry for the delay in the answer. I've been checking the first analysis I made, and I have find a mistake in the faa and fna data. Since this I'am running the analysis as you do, to obtain the same results, but my computing capacity seems to be less powerful than yours as it seems to take 300 hours to run OMA, even when -s is used. Or is this maybe indicating a trouble with my analysis?
Once I have the results I'll check them in comparison with yours, and proceed, if everything is okay, with the next steps as you said.
Nevertheless, I understand it will take more time since I'll include the SRA data in the analysis.
Once everything is finish I will be back to you.
Thank you so much.
Alfred
P.S: I have include my marker_genes folder here in case it helps. marker_genes.zip
Hello,
Thank you first for your interesting work.
I'm trying to apply your method to some analyses over corals. OMA only contain few distant species of cnidarians, so I followed the steps in "obtaining marker genes for viral dataset", obtaining the CDS and cDNA of three species related to the ones I'm studiying.
After obtaining 16135 OGS, OMA was run. Nevethelss, when this OGS are compared with my sequences the message "This OG did not have any DNA" for each OG.
After the analysis, the next lines are showed: Traceback (most recent call last): File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 292, in _align_worker align.dna = self._get_translated_alignment(codons, alignment) File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 216, in _get_translated_alignment codon = codons[rec.id] KeyError: '04095' """
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in
import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree')
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script
exec(script_code, namespace, namespace)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 291, in main
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 51, in init
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/Aligner.py", line 330, in _align
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 367, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 774, in get
raise self._value
KeyError: '04095'
(base) alfred@adn10:/media/Data/Alfred/Read2tree$ read2tree --standalone_path /media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup
Traceback (most recent call last):
File "/home/alfred/miniconda3/envs/r2t/bin/read2tree", line 4, in
import('pkg_resources').run_script('read2tree==0.1.5', 'read2tree')
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 720, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/pkg_resources/init.py", line 1570, in run_script
exec(script_code, namespace, namespace)
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/EGG-INFO/scripts/read2tree", line 16, in
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/main.py", line 287, in main
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 22, in init
File "/home/alfred/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree-0.1.5-py3.10.egg/read2tree/parser/OMAOutputParser.py", line 64, in _check_oma_output_path
FileNotFoundError: [Errno 2] No such file or directory: '/media/Data/Alfred/Read2tree/myWorkingDir/Output/OrthologousGroup'
(base) alfred@adn10:/media/Data/Alfred/Read2tree$
The resulting outputs file doesn't contain any aligns or three files.
Might it be a problem with the installation of the software or is related with the OGs obtained from the species I selected?
Thank you