SunPengChuan / wgdi

WGDI: A user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes
https://wgdi.readthedocs.io/en/latest/
BSD 2-Clause "Simplified" License
114 stars 22 forks source link

Problem of preparing the data #47

Closed hyyuu closed 7 months ago

hyyuu commented 8 months ago

Hello, Thank you very much for this program !

I am at the first step of preparing the input, yet I have an issue with deal_gff.py. If I am not mistaken, I can modify and generate all required files with this single script, without using 0.1.py , 0.2.py and 03.py. However, deal_gff.py returned empty cds and pep files , but a complete lens files. Could you please help me with this problem?

All my data file was downloaded from NCBI Refseq (without any editing). And here is my command: python deal_gff.py Tigriopus_californicus_GCF_007210705.1_Tcal_SD_v2.1_genomic.gff Tigriopus_californicus_GCF_007210705.1_Tcal_SD_v2.1_cds_from_genomic.cds.fasta Tigriopus_californicus_GCF_007210705.1_Tcal_SD_v2.1_protein.pep.fasta tig1

Thank you very much for helping !

Regards, Alex

hyyuu commented 8 months ago

Hello, I just further checked the generated lens file and gff files, and compare with the original pep.fasta

Column 2 of the generated gff file is a replicate of chromosome name , but not the gene id

NC_081440.1 tig1_NC_081440.1g00001  6994    8844    -   1   rna-XM_059224180.1
NC_081440.1 tig1_NC_081440.1g00002  13054   26311   +   2   rna-XM_059235080.1

The original gff file:

NC_081440.1 RefSeq  region  1   16497411    .   +   .   ID=NC_081440.1:1..16497411;Dbxref=taxon:6832;Name=1;chromosome=1;collection-date=2012;country=USA: Ocean Beach%2C San Diego%2C California;gbkey=Src;genome=chromosome;isolation-source=water;lat-lon=32.75 N 117.25 W;mol_type=genomic DNA;strain=San Diego
NC_081440.1 Gnomon  gene    6994    8844    .   -   .   ID=gene-LOC131878253;Dbxref=GeneID:131878253;Name=LOC131878253;description=spermine synthase-like;gbkey=Gene;gene=LOC131878253;gene_biotype=protein_coding

The header of original pep.fa

>XP_059078263.1 iroquois-class homeodomain protein IRX-1-like isoform X2 [Tigriopus californicus]
>XP_059078266.1 uncharacterized protein LOC131876794 [Tigriopus californicus]

The header of original cds.fa >lcl|NC_081440.1_cds_XP_059080163.1_1 [gene=LOC131878253] [db_xref=GeneID:131878253] [protein=spermine synthase-like] [protein_id=XP_059080163.1] [location=complement(join(7091..7402,7501..7734,7827..8017,8107..8258,8336..8829))] [gbkey=CDS]

Would this be the cause of the error ?

Thank you very much !

Regards, Alex

SunPengChuan commented 8 months ago

You're right, that's the issue here. The problem is in deal_gff.py. I usually tweak line 22 and throw in an extra line at 59 and 72 to make sure all the IDs match up. But if fiddling with Python code isn't your thing, no worries! Just shoot me over the cds, pep, and gff files, and I'll send you back the fixed-up deal_gff.py. Easy peasy!

hyyuu commented 8 months ago

Thank you for your prompt reply! Sorry that I am not familiar with Python, so it is very kind of you to help me edit the script!
My files are directly downloaded from NCBI, here is the FTP link https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/007/210/705/GCF_007210705.1_Tcal_SD_v2.1/

Thank you very much for your kind help again!

Regards, Alex

SunPengChuan commented 8 months ago

I have updated the deal_gff_ncbi.py based on the example you provided and placed it at https://github.com/SunPengChuan/wgdi-example/code. I hope this will assist you in processing your data subsequently.

hyyuu commented 8 months ago

Hello !

Thanks for your script and instruction! It works perfectly now. However, I bump into another problem now. I am running option -ks , but it returned below error.


Traceback (most recent call last):
  File "/home/hiuyan/.pyenv/versions/3.6.12/bin/wgdi", line 11, in <module>
    sys.exit(main())
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgdi/run.py", line 163, in main
    module_to_run(arg, value)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgdi/run.py", line 122, in module_to_run
    run_subprogram(program, conf, name)
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgdi/run.py", line 87, in run_subprogram
    r.run()
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgdi/ks.py", line 99, in run
    kaks = self.pair_kaks(['gene1', 'gene2'])
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgdi/ks.py", line 111, in pair_kaks
    self.align()
  File "/home/hiuyan/.pyenv/versions/3.6.12/lib/python3.6/site-packages/wgdi/ks.py", line 132, in align
    stdout, stderr = muscle_cline()
  File "/home/hiuyan/.local/lib/python3.6/site-packages/Bio/Application/__init__.py", line 569, in __call__
    raise ApplicationError(return_code, str(self), stdout_str, stderr_str)
Bio.Application.ApplicationError: Non-zero return code 126 from '/home/hiuyan/tools/muscle-5.1.0 -in pair.pep -out prot.aln -seqtype protein -clwstrict', message '/bin/sh: 1: /home/hiuyan/tools/muscle-5.1.0: Permission denied'

I have executed the config. file as instructed. Here is my file. I am using a share cluster and I do not have root access.

[ini]
mafft_path = /home/share/tools/mafft-7.402-with-extensions/binaries
pal2nal_path = /home/hiuyan/tools/pal2nal.v14/pal2nal.pl
yn00_path = /home/chulab/share/jiaojiao/tools/PAML/paml4.9e/bin/yn00
muscle_path = /home/hiuyan/tools/muscle-5.1.0/src
iqtree_path = /home/hiuyan/fishball/tools/iqtree-1.6.12-Linux/bin
trimal_path = /home/share/tools/trimal-trimAl/source
fasttree_path = /home/sunpc/miniconda3/bin/fasttree
divvier_path = /bin/divvier

I cannot find relevant help online. Could you please help me with this error?

Thank you again!

Regards, Alex

SunPengChuan commented 8 months ago

The version of MUSCLE requires 3.8, and I've been wanting to upgrade it to support 5.1, but I haven't had the time to do it. I'll make the change when I have time later, but for now,you can use MUSCLE 3.8 or MAFFT.

hyyuu commented 7 months ago

Sorry for the late reply ! It works after I change the path and the version of MUSCLE ! Thank you for your kind help again !

Regards, Alex