RiversDong / GageTracker

tool for dating gene age by micro- and macro-synteny with high speed and accuracy
10 stars 4 forks source link

Issue with windowmasker Step in GageTracker Workflow #4

Open tachengtatangi opened 1 week ago

tachengtatangi commented 1 week ago

Hi,

First of all, thank you for developing such an excellent tool! I am encountering an issue during the windowmaskerstep while processing my target genome. The error message I receive is:

/data/xxx/miniconda2/envs/hgt/lib/python3.8/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The error_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.

  chunk_iterator = pd.read_csv(
/data/xxx/miniconda2/envs/hgt/lib/python3.8/site-packages/gtfparse/read_gtf.py:82: FutureWarning: The warn_bad_lines argument has been deprecated and will be removed in a future version. Use on_bad_lines in the future.

  chunk_iterator = pd.read_csv(
INFO:root:Extracted GTF attributes: ['gene_id', 'transcript_id', 'db_xref', 'description', 'gbkey', 'gene', 'gene_biotype', 'experiment', 'model_evidence', 'product', 'transcript_biotype', 'exon_number', 'protein_id', 'pseudo', 'exception', 'note', 'deleted', 'inference', 'anticodon', 'transl_except', 'substituted', 'partial', 'part', 'standard_name']
**Error: (Exception::open failed) could not open /data/xxx/Species/CommonTest/NewGenes/Pmes/dating/masking/Pmes.fa.hm.tmp
Error: (Exception::creation failure) could not create a unit counts container
Error: (106.16) Application's execution failed (Exception::creation failure) could not create a unit counts container
computing the genome length
Error: (Exception::creation failure) unrecognized unit counts format
Error: (Exception::creation failure) could not create a unit counts container
Error: (106.16) Application's execution failed (Exception::creation failure) could not create a unit counts container**
FASTA-Reader: Start of first data line in seq is about 61% ambiguous nucleotides (shouldn't be over 40%)
FASTA-Reader: Start of first data line in seq is about 58% ambiguous nucleotides (shouldn't be over 40%)

Following this, I was unable to generate.tmpmaskand subsequent files. Additionally, I observed repetitive outputs totaling around 700 GB, consisting of the following lines:

Use of uninitialized value $s in string at /data/xxx/biosoft/lastz/GageTracker/gene.exon.upper.pl line 42, <List3> line 1.
Use of uninitialized value $string in substr at /data/xxx/biosoft/lastz/GageTracker/gene.exon.upper.pl line 38, <List3> line 1.

Could you please advise on what might be causing this issue? Below is my ctl file configuration. The genome I selected is hard-masked, and the command I ran is: GageTracker Pmes.ctl -p 15

This is the ctl file: Pmes.txt

If you need any additional information, please feel free to let me know. I sincerely appreciate your time and assistance. Thank you again!

Best regards,

RiversDong commented 1 week ago

Thanks for this querying. Could you kindly share Pmes.fa.hm and Lafr.fa.hm with me? So that I can debug GageTracker.

tachengtatangi commented 1 week ago

Thank you very much for your prompt response. Due to file size limitations on GitHub, may I send these two genomes to your email instead? I noticed your email address (chuand@whu.edu.cn) in another issue—would it be alright if I send the files there?

RiversDong commented 1 week ago

If the file is too large, and it might be inconvenient to send it via email. I think it would be better to share the data on Google Drive and send me the link, so I can download it myself.

tachengtatangi commented 1 week ago

Thanks for your reminder. I’ve uploaded the two genomes to Google Drive, and the link is https://drive.google.com/drive/folders/1O-KQjGf4YLuTm0ToG7s8mB8KAx9HhyX-?usp=sharing