Dee-chen / Tree2gd

GNU General Public License v3.0
34 stars 7 forks source link

about the erro of step2 #13

Open lfp-a opened 1 year ago

lfp-a commented 1 year ago

When I run step 2, I get an error. Here's what debug says

19:47:53: DEBUG

19:47:53: INFO The program

/data/wanglab/liufangpu/miniconda3/envs/wgd/lib/python3.8/site-packages/tree2gd/software/PhyloMCL working properly [OK]

19:47:53: INFO The Step2 output dir :/data/wanglab/liufangpu/genome/gesneriaceae/tree2gd/new/output/step2.MCL does

not exist,will create it. 19:47:53: INFO Writing step2 script files... 19:47:53: INFO Start runing step2 MCL: /data/wanglab/liufangpu/genome/gesneriaceae/tree2gd/new/output/step2.MCL/step2.sh.. 22:11:24: DEBUG step2.sh stdout:

22:11:24: DEBUG step2.sh stderr: Running PhyloMCL...

    Parameters for threads:                         128
    Parameters for MCL inflation:                   1.10
    Parameters for OG neighbours:                   5
    Parameters for gene copy number:                6
    Parameters for similarity:                      0.20
    Parameters for fold-change of group numbers:    4.00

File Size 259343203990, threadSize 32417900498 === thread[2] === start_position[64835801082] === end_position[97253701644] === lines[283451201] === thread[1] === start_position[32417900500] === end_position[64835801082] === lines[331855295] === thread[5] === start_position[162089502752] === end_position[194507403275] === lines[354020981] === thread[3] === start_position[97253701644] === end_position[129671602184] === lines[353096450] === thread[7] === start_position[226925303804] === end_position[259343203990] === lines[353738690] === thread[0] === start_position[0] === end_position[32417900500] === lines[353334541] === thread[4] === start_position[129671602184] === end_position[162089502752] === lines[353067825] === thread[6] === start_position[194507403275] === end_position[226925303804] === lines[352888154]

Gene numbers: 5107714 BLAST lines: -1559514159 0: 353334541 1: 331855295 2: 283451201 3: 353096450 4: 353067825 5: 354020981 6: 352888154 7: 353738690 BLAST PARSED Species numbers 127 Length and species file loaded 1.000 not in species map

22:11:24: INFO step2 mcl has done.

The following is the config setting

[software]

[postfix]

pep=.pep cds=.cds [diamond]

-e=1e-10 [phymcl]

[mcl2fasta] min_taxa=10

[iqtree]

-B=1000 -m=JTT+G4 [tree2gd]

--bp=50

Thank you for your help

Dee-chen commented 1 year ago

According to the error report result, "1.000 not in species map" appears, so it should be because of the problem in your input file that the number "1.000" is incorrectly identified as a species name. I think it is caused by the blank space in the cds or protein sequence name you input. PhyloMCL has strict requirements on this aspect (as it is packaged software, I cannot modify it). It is recommended that you try to repair it according to the following process;

  1. First determine whether you have 127 species. Check whether the "- species " and "- tree" parameter files in "step2.sh" correspond to each other. The "- species " file should contain "sequence name t species name" to determine where the "1.000" wrong species name appears.
  2. After determining the input file of the affected species. Consider using "awk '{print $1}'" or the cut command to output only the first section name and sequence of the * fa file. 3 Run tree2gd again to check whether the repair is successful.

If you do not want to repeat the first step of sequence alignment. You need to adjust the blastp sequence name of the "- in" file in "step2.sh", the sequence name and species name in "- species", and the species name in "- tree" to ensure that PhyloMCL runs successfully.

PS. In our experience, we generally pre process all sequence names as ">speciesnames_GeneID", which will facilitate later analysis.

lfp-a commented 1 year ago

Thank you for your help. I'll have a try