DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
142 stars 18 forks source link

Applied read2tree to customer marker gene sets #54

Closed YuWang07 closed 9 months ago

YuWang07 commented 9 months ago

Dear team: Thanks for developing this tool which is really helpful. I am trying to use it for a customer marker gene sets and I reformat the protein sequence and cds sequence as suggested here: [#22]

Here is an example of the ids of the markers: marker protein: >snapmasked-CP113692.1-processed-gene-8.84-mRNA-1 | [GCA027563075.1] cds ref: >snapmasked-CP113692.1-processed-gene-8.84-mRNA-1

I also remove all the underscore "_" as previously someone asked, but i still keep getting this error:

Creates the reference folder
--- Load OGs with min 0 species from oma /scratch/project/wwvirus/RO3/02_orthogroup/04.marker_reformat/02_sub14/02.AA - mode = marker_genes ---
^MLoading files for pre-filter:   0%|          | 0/32 [00:00<?, ? OGs/s]^MLoading files for pre-filter: 100%|██████████| 32/32 [00:00<00:00, 984.64 OGs/s]
2023-12-26 21:55:26,900 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from /scratch/project/wwvirus/RO3/02_orthogroup/04.marker_reformat/02_sub14/01.CDS/dna_ref.fa ---
2023-12-26 21:55:26,900 - read2tree.OGSet - INFO - Loading dna_ref.fa into memory. This might take a while . . .
^MLoading OGs:   0%|          | 0/32 [00:00<?, ? OGs/s]^MLoading OGs:   0%|          | 0/32 [00:01<?, ? OGs/s]
Traceback (most recent call last):
  File "/home/s4519890/miniconda3/envs/r2t/bin/read2tree", line 16, in <module>
    main(sys.argv[1:], exe_name=exe_name(), desc=desc)
  File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/main.py", line 289, in main
    ogset = OGSet(args, oma_output=oma_output, progress=progress)  # Generate the OGs with their DNA sequences
  File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 79, in __init__
    self.ogs = self._load_ogs()
  File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 186, in _load_ogs
    ogs[name].dna = self._get_dna_records(ogs[name].aa,
  File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 361, in _get_dna_records
    og_cdna.append(self._get_dna_from_fasta(record, db))
  File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 322, in _get_dna_from_fasta
    return self._get_dna_from_REST(record)
  File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 282, in _get_dna_from_REST
    seq = oma_record.json()['cdna']
KeyError: 'cdna'

Could you please help on this issue? Any idea about why this happen?

Thanks you very much.

sinamajidian commented 9 months ago

Hello

Thanks for using read2tree. It seems that the fromatting is not correct. I guess the record id of both should be indentical, but I need to double check. Could you please send us a few fasta files of gene markers and their cds?

Best, Sina

On Tue, 26 Dec 2023, 15:29 YuWang, @.***> wrote:

Dear team: Thanks for developing this tool which is really helpful. I am trying to use it for a customer marker gene sets and I reformat the protein sequence and cds sequence as suggested here: [#22 https://github.com/DessimozLab/read2tree/issues/22]

Here is an example of the ids of the markers: marker protein: >snapmasked-CP113692.1-processed-gene-8.84-mRNA-1 | [GCA027563075.1] cds ref: >snapmasked-CP113692.1-processed-gene-8.84-mRNA-1

I also remove all the underscore "_" as previously someone asked, but i still keep getting this error:

Creates the reference folder --- Load OGs with min 0 species from oma /scratch/project/wwvirus/RO3/02_orthogroup/04.marker_reformat/02_sub14/02.AA - mode = marker_genes --- ^MLoading files for pre-filter: 0%| | 0/32 [00:00<?, ? OGs/s]^MLoading files for pre-filter: 100%|██████████| 32/32 [00:00<00:00, 984.64 OGs/s] 2023-12-26 21:55:26,900 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from /scratch/project/wwvirus/RO3/02_orthogroup/04.marker_reformat/02_sub14/01.CDS/dna_ref.fa --- 2023-12-26 21:55:26,900 - read2tree.OGSet - INFO - Loading dna_ref.fa into memory. This might take a while . . . ^MLoading OGs: 0%| | 0/32 [00:00<?, ? OGs/s]^MLoading OGs: 0%| | 0/32 [00:01<?, ? OGs/s] Traceback (most recent call last): File "/home/s4519890/miniconda3/envs/r2t/bin/read2tree", line 16, in main(sys.argv[1:], exe_name=exe_name(), desc=desc) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/main.py", line 289, in main ogset = OGSet(args, oma_output=oma_output, progress=progress) # Generate the OGs with their DNA sequences File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 79, in init self.ogs = self._load_ogs() File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 186, in _load_ogs ogs[name].dna = self._get_dna_records(ogs[name].aa, File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 361, in _get_dna_records og_cdna.append(self._get_dna_from_fasta(record, db)) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 322, in _get_dna_from_fasta return self._get_dna_from_REST(record) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/OGSet.py", line 282, in _get_dna_from_REST seq = oma_record.json()['cdna'] KeyError: 'cdna'

Could you please help on this issue? Any idea about why this happen?

Thanks you very much.

— Reply to this email directly, view it on GitHub https://github.com/DessimozLab/read2tree/issues/54, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJ7LXTBHPWUB4LLQUVLKVILYLK3ZHAVCNFSM6AAAAABBDGWG7SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2TMMZUGQ2TEMQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

YuWang07 commented 9 months ago

Hi Sina

Thanks for the quick reply. The id should be identical in my maker and cds file. I have attached my maker gene set. Here is the protein fasta file with >id | [species_name]: 02.AA.zip the reference cds fasta file with >id: dna_ref.fa.zip the reference cds fasta file with >id | [species_name]: ref.format.cds.fa.zip

I try both of them but get the same error when creating the reference folder:

read2tree --standalone_path --standalone_path 02.AA/ --output_path ${OUT_DIR} --reference --dna_reference dna_ref.fa

Thanks. Yu

sinamajidian commented 9 months ago

Dear Yu The ID of fasta records should be identical for both aa and dna. I can see some slight differences

 $ grep CP113692 02.AA/*
02.AA/OG0012664.fa:>snapmasked-CP113692.1-processed-gene-8.84-mRNA-1 | [GCA027563075.1]
02.AA/OG0012753.fa:>snapmasked-CP113692.1-processed-gene-8.66-mRNA-1 | [GCA027563075.1]
 $ grep CP113692 dna_ref.fa
>snapmaskedCP113692.1processedgene8.84mRNA1 | [GCA027563075.1]
>snapmaskedCP113692.1processedgene8.66mRNA1 | [GCA027563075.1]
 $ grep CP113692  ref.format.cds.fa
>snap_masked-CP113692.1-processed-gene-8.84-mRNA-1
>snap_masked-CP113692.1-processed-gene-8.66-mRNA-1
 $ grep "snapmasked-CP113692.1-processed-gene-8.84-mRNA-1"  ref.format.cds.fa
 $ 

Please send me a new version, so I can check whether adding [species] is enough or not.

alpae commented 9 months ago

Dear Yu,

it seems that now read2tree does ignore the --dna_reference argument (in log: Load logs … from REST-API). I think it checks that the file name contains a .fa or .fasta extension.

so by renaming the dna.cds.da -> dna.cds.fa it should work I assume.

Best Adrian

On 9 Jan 2024, at 05:54, YuWang @.***> wrote:

Hi, Sina

Thank you for the help.

I simplify the gene name for the marker gene set. But when I try to build the reference. Here is some example of the input files:

$ grep ">" dna.cds.da | head

OSNDL11138-RA | [GCA_008403515.1] RNTFV01679-RA | [GCA_014843625.1] NBQJT01402-RA | [GCA_027563095.1] CDDWM01290-RA | [GCA_027563055.1] TNUXI01279-RA | [GCA_027563075.1] TRFAQ09314-RA | [GCA_902703645.1] WBVKE10572-RA | [GCA_000499105.1] LJCVG04613-RA | [GCA_020891285.1] OSNDL10718-RA | [GCA_008403515.1] RNTFV00574-RA | [GCA_014843625.1]

$ grep ">" OG0012668.fa

OSNDL11138-RA | [GCA_008403515.1] RNTFV01679-RA | [GCA_014843625.1] NBQJT01402-RA | [GCA_027563095.1] CDDWM01290-RA | [GCA_027563055.1] TNUXI01279-RA | [GCA_027563075.1] TRFAQ09314-RA | [GCA_902703645.1] WBVKE10572-RA | [GCA_000499105.1] LJCVG04613-RA | [GCA_020891285.1] I get the following error:

read2tree --standalone_path ./00.AA/ --output_path /scratch/project/wwvirus/RO3_24_reanalysis/02_R2T_OUT --reference --dna_reference ./dna.cds.da

--- Load OGs with min 0 species from oma 00.AA - mode = marker_genes --- Loading files for pre-filter: 100%|███████████████████████████████| 32/32 [00:00<00:00, 348.23 OGs/s] 2024-01-09 14:36:46,716 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq using the REST api --- Loading OGs: 0%| | 0/32 [00:00<?, ? OGs/s]2024-01-09 14:36:49,640 - read2tree.OGSet - WARNING - This OG OG0012773 did not have any DNA Loading OGs: 3%|█▌ | 1/32 [00:02<01:30, 2.92s/ OGs]2024-01-09 14:36:51,941 - read2tree.OGSet - WARNING - This OG OG0012825 did not have any DNA Loading OGs: 6%|███▏ | 2/32 [00:05<01:16, 2.56s/ OGs]2024-01-09 14:36:54,196 - read2tree.OGSet - WARNING - This OG OG0012882 did not have any DNA Loading OGs: 9%|████▋ | 3/32 [00:07<01:10, 2.42s/ OGs]2024-01-09 14:36:56,467 - read2tree.OGSet - WARNING - This OG OG0012811 did not have any DNA Loading OGs: 12%|██████▎ | 4/32 [00:09<01:06, 2.36s/ OGs]2024-01-09 14:36:58,766 - read2tree.OGSet - WARNING - This OG OG0012807 did not have any DNA Loading OGs: 16%|███████▊ | 5/32 [00:12<01:03, 2.34s/ OGs]2024-01-09 14:37:01,057 - read2tree.OGSet - WARNING - This OG OG0012748 did not have any DNA Loading OGs: 19%|█████████▍ | 6/32 [00:14<01:00, 2.32s/ OGs]2024-01-09 14:37:03,316 - read2tree.OGSet - WARNING - This OG OG0012756 did not have any DNA Loading OGs: 22%|██████████▉ | 7/32 [00:16<00:57, 2.30s/ OGs]2024-01-09 14:37:05,579 - read2tree.OGSet - WARNING - This OG OG0012760 did not have any DNA Loading OGs: 25%|████████████▌ | 8/32 [00:18<00:54, 2.29s/ OGs]2024-01-09 14:37:07,841 - read2tree.OGSet - WARNING - This OG OG0012812 did not have any DNA Loading OGs: 28%|██████████████ | 9/32 [00:21<00:52, 2.28s/ OGs]2024-01-09 14:37:10,144 - read2tree.OGSet - WARNING - This OG OG0012816 did not have any DNA Loading OGs: 31%|███████████████▎ | 10/32 [00:23<00:50, 2.29s/ OGs]2024-01-09 14:37:12,410 - read2tree.OGSet - WARNING - This OG OG0012668 did not have any DNA Loading OGs: 34%|████████████████▊ | 11/32 [00:25<00:47, 2.28s/ OGs]2024-01-09 14:37:14,680 - read2tree.OGSet - WARNING - This OG OG0012707 did not have any DNA Loading OGs: 38%|██████████████████▍ | 12/32 [00:27<00:45, 2.28s/ OGs]2024-01-09 14:37:16,960 - read2tree.OGSet - WARNING - This OG OG0012732 did not have any DNA Loading OGs: 41%|███████████████████▉ | 13/32 [00:30<00:43, 2.28s/ OGs]2024-01-09 14:37:19,245 - read2tree.OGSet - WARNING - This OG OG0012867 did not have any DNA Loading OGs: 44%|█████████████████████▍ | 14/32 [00:32<00:41, 2.28s/ OGs]2024-01-09 14:37:21,502 - read2tree.OGSet - WARNING - This OG OG0012716 did not have any DNA Loading OGs: 47%|██████████████████████▉ | 15/32 [00:34<00:38, 2.27s/ OGs]2024-01-09 14:37:23,800 - read2tree.OGSet - WARNING - This OG OG0012689 did not have any DNA Loading OGs: 50%|████████████████████████▌ | 16/32 [00:37<00:36, 2.28s/ OGs]2024-01-09 14:37:26,068 - read2tree.OGSet - WARNING - This OG OG0012755 did not have any DNA Loading OGs: 53%|██████████████████████████ | 17/32 [00:39<00:34, 2.28s/ OGs]2024-01-09 14:37:28,334 - read2tree.OGSet - WARNING - This OG OG0012841 did not have any DNA Loading OGs: 56%|███████████████████████████▌ | 18/32 [00:41<00:31, 2.27s/ OGs]2024-01-09 14:37:30,622 - read2tree.OGSet - WARNING - This OG OG0012737 did not have any DNA Loading OGs: 59%|█████████████████████████████ | 19/32 [00:43<00:29, 2.28s/ OGs]2024-01-09 14:37:32,899 - read2tree.OGSet - WARNING - This OG OG0012680 did not have any DNA Loading OGs: 62%|██████████████████████████████▋ | 20/32 [00:46<00:27, 2.28s/ OGs]2024-01-09 14:37:35,149 - read2tree.OGSet - WARNING - This OG OG0012686 did not have any DNA Loading OGs: 66%|████████████████████████████████▏ | 21/32 [00:48<00:24, 2.27s/ OGs]2024-01-09 14:37:37,419 - read2tree.OGSet - WARNING - This OG OG0012878 did not have any DNA Loading OGs: 69%|█████████████████████████████████▋ | 22/32 [00:50<00:22, 2.27s/ OGs]2024-01-09 14:37:39,688 - read2tree.OGSet - WARNING - This OG OG0012690 did not have any DNA Loading OGs: 72%|███████████████████████████████████▏ | 23/32 [00:52<00:20, 2.27s/ OGs]2024-01-09 14:37:41,943 - read2tree.OGSet - WARNING - This OG OG0012888 did not have any DNA Loading OGs: 75%|████████████████████████████████████▊ | 24/32 [00:55<00:18, 2.26s/ OGs]2024-01-09 14:37:44,185 - read2tree.OGSet - WARNING - This OG OG0012692 did not have any DNA Loading OGs: 78%|██████████████████████████████████████▎ | 25/32 [00:57<00:15, 2.26s/ OGs]2024-01-09 14:37:46,422 - read2tree.OGSet - WARNING - This OG OG0012712 did not have any DNA Loading OGs: 81%|███████████████████████████████████████▊ | 26/32 [00:59<00:13, 2.25s/ OGs]2024-01-09 14:37:48,674 - read2tree.OGSet - WARNING - This OG OG0012838 did not have any DNA Loading OGs: 84%|█████████████████████████████████████████▎ | 27/32 [01:01<00:11, 2.25s/ OGs]2024-01-09 14:37:50,941 - read2tree.OGSet - WARNING - This OG OG0012719 did not have any DNA Loading OGs: 88%|██████████████████████████████████████████▉ | 28/32 [01:04<00:09, 2.26s/ OGs]2024-01-09 14:37:53,212 - read2tree.OGSet - WARNING - This OG OG0012854 did not have any DNA Loading OGs: 91%|████████████████████████████████████████████▍ | 29/32 [01:06<00:06, 2.26s/ OGs]2024-01-09 14:37:55,480 - read2tree.OGSet - WARNING - This OG OG0012765 did not have any DNA Loading OGs: 94%|█████████████████████████████████████████████▉ | 30/32 [01:08<00:04, 2.26s/ OGs]2024-01-09 14:37:57,739 - read2tree.OGSet - WARNING - This OG OG0012718 did not have any DNA Loading OGs: 97%|███████████████████████████████████████████████▍ | 31/32 [01:11<00:02, 2.26s/ OGs]2024-01-09 14:38:00,008 - read2tree.OGSet - WARNING - This OG OG0012688 did not have any DNA Loading OGs: 100%|█████████████████████████████████████████████████| 32/32 [01:13<00:00, 2.29s/ OGs] 2024-01-09 14:38:00,012 - read2tree.OGSet - INFO - : Gathering of DNA seq for 32 OGs took 73.2949275970459. --- Generating reference for mapping --- Loading records: 100%|██████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 114422.62 record/s] 2024-01-09 14:38:00,013 - read2tree.ReferenceSet - INFO - : Extracted 8 reference species form 32 ogs took 0.0006504058837890625 --- Alignment of 32 OGs --- /home/s4519890/miniconda3/envs/r2t/lib/python3.10/subprocess.py:961: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used self.stdout = io.open(c2pread, 'rb', bufsize) /home/s4519890/miniconda3/envs/r2t/lib/python3.10/subprocess.py:966: RuntimeWarning: line buffering (buffering=1) isn't supported in binary mode, the default buffer size will be used self.stderr = io.open(errread, 'rb', bufsize) multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, *kwds)) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar return list(map(args)) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/Aligner.py", line 292, in _align_worker align.dna = self._get_translated_alignment(codons, alignment) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/Aligner.py", line 216, in _get_translated_alignment codon = codons[rec.id] KeyError: 'OSNDL' """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/s4519890/miniconda3/envs/r2t/bin/read2tree", line 16, in main(sys.argv[1:], exe_name=exe_name(), desc=desc) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/main.py", line 291, in main alignments = Aligner(args, ogset.ogs, load=True) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/Aligner.py", line 51, in init self.alignments = self._align(og_set) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/site-packages/read2tree/Aligner.py", line 330, in _align res_align = p.map(self._align_worker, og_chunks) File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 367, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/s4519890/miniconda3/envs/r2t/lib/python3.10/multiprocessing/pool.py", line 774, in get raise self._value KeyError: 'OSNDL' Could you please help on this issue?

Best Regards, Yu.

— Reply to this email directly, view it on GitHub https://github.com/DessimozLab/read2tree/issues/54#issuecomment-1882412411, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALAW5QZGPQEYQBRD4AGPODYNTER7AVCNFSM6AAAAABBDGWG7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBSGQYTENBRGE. You are receiving this because you are subscribed to this thread.

YuWang07 commented 9 months ago

Hi Sian and Adrian,

Thank you for your assistance. I found the error is caused by the file extension and by correcting it to *fa, enabling the script to run smoothly. Regarding the aa and dna ID generation, simplify the ID generated by Maker will make things much easier, all related issues have been solved now.

Furthermore, successful testing with the customer marker set and raw sequencing confirms the flawless performance of 'read2tree'.

I deeply appreciate your development efforts and help. This matter can now be considered resolved as I'll be closing the issue.

Best Regards, Yu

alpae commented 9 months ago

glad to hear that the software is now working for you.