ComparativeGenomicsToolkit / Comparative-Annotation-Toolkit

Apache License 2.0
170 stars 48 forks source link

assert interval.data not in ref_interval_map #248

Open chatla01 opened 3 years ago

chatla01 commented 3 years ago

Hi,

I did have some successful CAT runs, I tried recently with new genomes. This is the Error I got.

ERROR: 2021-02-18 22:17:05,196 - [pid 14862] Worker Worker(salt=869007839, workers=10, host=n0093.savio2, username=kchatla, pid=9220) failed    Ta
sk: EvaluateTransMapDriverTask for dimm
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/luigi/worker.py", line 191, in run
    new_deps = self._run_get_new_deps()
  File "/usr/local/lib/python3.8/dist-packages/luigi/worker.py", line 133, in _run_get_new_deps
    task_gen = self.task.run()
  File "/usr/local/lib/python3.8/dist-packages/cat-2.0-py3.8.egg/cat/__init__.py", line 1380, in run
    results = transmap_classify(self.tm_eval_args)
  File "/usr/local/lib/python3.8/dist-packages/cat-2.0-py3.8.egg/cat/transmap_classify.py", line 46, in transmap_classify
    synteny_scores = synteny(ref_gp_dict, gp_dict)
  File "/usr/local/lib/python3.8/dist-packages/cat-2.0-py3.8.egg/cat/transmap_classify.py", line 182, in synteny
    ref_interval_map = make_ref_interval_map(ref_chrom_intervals)
  File "/usr/local/lib/python3.8/dist-packages/cat-2.0-py3.8.egg/cat/transmap_classify.py", line 173, in make_ref_interval_map
    assert interval.data not in ref_interval_map
AssertionError

Attached log file. dimm_log.txt

Thank you in advance. Chatla

ifiddes commented 3 years ago

Hi,

Just got your email, sorry for not noticing this before. This is a new error. Can you send me the following two files:

  1. workDir/transMap/dimm.filtered.gp
  2. workDir/reference/.gp

Thanks

ifiddes commented 3 years ago

I am surprised this issue has never been hit before, to be honest. The issue here is that the synteny classifier is expecting that the gene_id identifier of a gene is unique. This is so that it can compare synteny between the source genome and the target genome. This is breaking for your tRNAs, whose gene names are not unique. I could probably bypass this here, but I am not sure how things downstream will behave if there are shared gene identifiers like this. It is probably best to fix the source file to have gene_id values that are unique to individual loci. I am going to modify the validate_gff3 script to throw an error in these cases.

Can you send me your input GFF3 file so I can verify that my hypothesis is correct, and that the changes to the validator script work? I will also send you back a fixed version of the input GFF3.

ifiddes commented 3 years ago

Hello,

I just made a new branch fix/disjoint_chrom in PR #252. This branch contains a requirement that genes be on the same chromosome, which will fix the crash you ran in to. There is also a new fixer script to fix gff3 files in programs/fix_chrom_disjoint_genes. This script will produce GFF3 that are valid.

What this script cannot do is fix genes that are disjoint on the same chromosome. This is due to a limitation of the genePred file format. In your annotation files, I did detect a few instances of these, but I don't think they will cause a huge problem here. The updated validate_gff3 script will now warn about such genes.

You will need to restart your CAT run from the beginning with these new GFF3 files. You can retain the chain files in workdir/chaining to reduce compute time.

chatla01 commented 3 years ago

Hi Ian,

Thank you it worked.