DessimozLab / read2tree

a tool for inferring species tree from sequencing reads
MIT License
138 stars 18 forks source link

for test dataset, tree topology matches but not branch lengths; is this ok? #57

Closed douglasgscofield closed 6 months ago

douglasgscofield commented 6 months ago

I've installed read2tree as a module on our cluster using python 3.10.8 into a virtualenv for the python parts and existing modules for iqtree, ngm, ngmlr, MAFFT, and samtools. This seems to work, and running the test dataset produces the expected tree topology (with rotations), but the branch lengths are proportional but do not match:

(sample_1:0.0512981536,((MNELE:0.9380208908,XENLA:0.1480736158):0.1345595662,(HUMAN:0.0040825358,GORGO:0.0101486675):0.0491682192):0.0636010693,RATNO:0.0166937325);

Is this ok?

I also get a RuntimeWarning: Mean of empty slice from numpy but that may be there being 0 mapping reads at that point?

Below is the output of

read2tree --debug --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output.1853 --dna_reference dna_ref.fa

and I've attached output.1853.zip containing the output directory and mplog.log.

rackham5: /sw/bioinfo/read2tree/0.1.5-20240117-ff2d167/src/read2tree/tests $ read2tree --debug --tree --standalone_path marker_genes/ --reads sample_1.fastq sample_2.fastq --output_path output.1853 --dna_reference dna_ref.fa
--- Load OGs with min 0 species from oma marker_gen
es - mode = marker_genes ---
Loading files for pre-filter: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 326.07 OGs/s]
2024-01-25 18:54:01,145 - read2tree.OGSet - INFO - --- Load ogs and find their corresponding DNA seq from dna_ref.fa ---
2024-01-25 18:54:01,145 - read2tree.OGSet - INFO - Loading dna_ref.fa into memory. This might take a while . . .
Loading OGs: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 272.03 OGs/s]
2024-01-25 18:54:01,224 - read2tree.OGSet - INFO - sample_1: Gathering of DNA seq for 20 OGs took 0.07529854774475098.
--- Generating reference for mapping ---
Loading records: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 102801.57 record/s]
2024-01-25 18:54:01,225 - read2tree.ReferenceSet - INFO - sample_1: Extracted 5 reference species form 20 ogs took 0.0009586811065673828
--- Alignment of 20 OGs ---
2024-01-25 18:54:08,571 - read2tree.Aligner - INFO - sample_1: Alignment of 20 OGs took 7.336272954940796.
--- Mapping of reads to reference sequences ---
Mapping reads to species:   0%|                                                                                                                | 0/5 [00:00<?, ? species/s]2024-01-25 18:54:08,584 - read2tree.Mapper - INFO - sample_1: --- Mapping of reads to MNELE reference species ---
2024-01-25 18:54:13,834 - read2tree.Mapper - INFO - sample_1: Mapped 27 / 50000 reads to MNELE_OGs.fa
2024-01-25 18:54:13,913 - read2tree.Mapper - INFO - sample_1: Mapping to MNELE_OGs.fa references took 5.327188968658447.
/sw/bioinfo/read2tree/0.1.5-20240117-ff2d167/rackham/venv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:3504: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/sw/bioinfo/read2tree/0.1.5-20240117-ff2d167/rackham/venv/lib/python3.10/site-packages/numpy/core/_methods.py:129: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
/sw/bioinfo/read2tree/0.1.5-20240117-ff2d167/rackham/venv/lib/python3.10/site-packages/numpy/core/_methods.py:206: RuntimeWarning: Degrees of freedom <= 0 for slice
  ret = _var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/sw/bioinfo/read2tree/0.1.5-20240117-ff2d167/rackham/venv/lib/python3.10/site-packages/numpy/core/_methods.py:163: RuntimeWarning: invalid value encountered in divide
  arrmean = um.true_divide(arrmean, div, out=arrmean,
/sw/bioinfo/read2tree/0.1.5-20240117-ff2d167/rackham/venv/lib/python3.10/site-packages/numpy/core/_methods.py:198: RuntimeWarning: invalid value encountered in scalar divide
  ret = ret.dtype.type(ret / rcount)
Mapping reads to species:  20%|████████████████████▊                                                                                   | 1/5 [00:05<00:21,  5.45s/ species]2024-01-25 18:54:14,030 - read2tree.Mapper - INFO - sample_1: --- Mapping of reads to HUMAN reference species ---
2024-01-25 18:54:19,739 - read2tree.Mapper - INFO - sample_1: Mapped 93 / 50000 reads to HUMAN_OGs.fa
2024-01-25 18:54:19,816 - read2tree.Mapper - INFO - sample_1: Mapping to HUMAN_OGs.fa references took 5.784032821655273.
Mapping reads to species:  40%|█████████████████████████████████████████▌                                                              | 2/5 [00:11<00:17,  5.72s/ species]2024-01-25 18:54:19,942 - read2tree.Mapper - INFO - sample_1: --- Mapping of reads to RATNO reference species ---
2024-01-25 18:54:25,117 - read2tree.Mapper - INFO - sample_1: Mapped 99 / 50000 reads to RATNO_OGs.fa
2024-01-25 18:54:25,196 - read2tree.Mapper - INFO - sample_1: Mapping to RATNO_OGs.fa references took 5.2527854442596436.
Mapping reads to species:  60%|██████████████████████████████████████████████████████████████▍                                         | 3/5 [00:16<00:11,  5.56s/ species]2024-01-25 18:54:25,301 - read2tree.Mapper - INFO - sample_1: --- Mapping of reads to GORGO reference species ---
2024-01-25 18:54:30,864 - read2tree.Mapper - INFO - sample_1: Mapped 90 / 50000 reads to GORGO_OGs.fa
2024-01-25 18:54:30,938 - read2tree.Mapper - INFO - sample_1: Mapping to GORGO_OGs.fa references took 5.634907484054565.
Mapping reads to species:  80%|███████████████████████████████████████████████████████████████████████████████████▏                    | 4/5 [00:22<00:05,  5.62s/ species]2024-01-25 18:54:31,030 - read2tree.Mapper - INFO - sample_1: --- Mapping of reads to XENLA reference species ---
2024-01-25 18:54:36,491 - read2tree.Mapper - INFO - sample_1: Mapped 43 / 50000 reads to XENLA_OGs.fa
2024-01-25 18:54:36,570 - read2tree.Mapper - INFO - sample_1: Mapping to XENLA_OGs.fa references took 5.538030385971069.
Mapping reads to species: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:28<00:00,  5.62s/ species]
2024-01-25 18:54:36,663 - read2tree.Mapper - INFO - sample_1: Mapping to all references took 28.085158824920654.
Adding mapped seq to alignments: 100%|████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 321402.61 alignments/s]
Adding mapped seq to OG: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 341000.33 OGs/s]
--- Add inferred mapped sequence back to OGs ---
Adding mapped seq to OG: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 54971.22 OGs/s]
2024-01-25 18:54:36,721 - read2tree.OGSet - INFO - sample_1: Appending 10 reconstructed sequences to present OG took 0.0034279823303222656.
--- Add inferred mapped sequence back to alignment ---
Adding mapped seq to alignments: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 20/20 [00:00<00:00, 1131.07 alignments/s]
2024-01-25 18:54:36,804 - read2tree.Aligner - INFO - sample_1: Appending 3 reconstructed sequences to present Alignments took 0.018320322036743164.
--- Tree inference ---
2024-01-25 18:54:46,019 - read2tree.TreeInference - INFO - sample_1: Tree inference took 9.093188285827637.
(sample_1:0.0512981536,((MNELE:0.9380208908,XENLA:0.1480736158):0.1345595662,(HUMAN:0.0040825358,GORGO:0.0101486675):0.0491682192):0.0636010693,RATNO:0.0166937325);
sinamajidian commented 6 months ago

Dear @douglasgscofield

Thanks for using read2tree and contacting us for clarifications. Yes, it's working well. There is some heuristic in tree inference so the branch length might not be always the same. Note that the output tree is not rooted. You can ignore that warning. That's fine.

Best regards, Sina

douglasgscofield commented 6 months ago

Thank you for your feedback! Then I will close this issue.

Best, Douglas