Kuanhao-Chao / LiftOn

🚀 LiftOn: Accurate annotation mapping for GFF/GTF across assemblies
http://ccb.jhu.edu/lifton
GNU General Public License v3.0
48 stars 1 forks source link

gffutils database build failed with UNIQUE constraint failed: features.id #12

Open yeeus opened 3 weeks ago

yeeus commented 3 weeks ago

Useful and exciting tool! But when I ran lifton with the command:

lifton MF2_mat.v1.0.fa ~/rawdata/GRCh38/ref/GRCh38.p14.new_name.fa -sc 0.95 -copies -g ~/rawdata/GRCh38/ref/GCF_000001405.40_GRCh38.p14_genomic.gff -polish -o CN1v1.0_mat.lifton.gff -c -cds -ad RefSeq -f type.list -exclude_partial -t 10 -D

I got this error:

**********************
** Running miniprot **
**********************
gffutils database build failed with UNIQUE constraint failed: features.id

while there are so many warnings and a ValueError:

$tail -50 CN1v1.0/Mat/06.lifton/GRCh38/lifton.sh.sbatch.e
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:42,479 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '4c10942085bce244cfce502d028bd6f1'; ignoring all but the first
2024-07-01 15:57:42,480 - WARNING - Duplicate lines in file for id '7ab0878db5cb4b1d34b527f6f36432d5'; ignoring all but the first
2024-07-01 15:57:42,481 - WARNING - Duplicate lines in file for id '7e9c75fdd3787d0324c845de6e12c07e'; ignoring all but the first
2024-07-01 15:57:42,481 - WARNING - Duplicate lines in file for id 'dff13391ae5c98698f19335853b321e5'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,131 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:44,132 - WARNING - Duplicate lines in file for id 'b29b1fe6fd2e6d005c096993a10f019e'; ignoring all but the first
2024-07-01 15:57:45,000 - INFO - Populating features table and first-order relations: 3903620 features
2024-07-01 15:57:45,001 - INFO - Updating relations
2024-07-01 15:58:19,940 - INFO - Creating relations(parent) index
2024-07-01 15:58:23,229 - INFO - Creating relations(child) index
2024-07-01 15:58:27,449 - INFO - Creating features(featuretype) index
2024-07-01 15:58:30,206 - INFO - Creating features (seqid, start, end) index
2024-07-01 15:58:33,525 - INFO - Creating features (seqid, start, end, strand) index
2024-07-01 15:58:37,309 - INFO - Running ANALYZE features
>> Creating miniprot annotation database : ./lifton_output/miniprot/miniprot.gff3
2024-07-01 15:58:39,206 - INFO - Populating features
2024-07-01 16:00:27,613 - INFO - Populating features table and first-order relations: 1912405 features
2024-07-01 16:00:27,613 - INFO - Updating relations
2024-07-01 16:00:37,349 - INFO - Creating relations(parent) index
2024-07-01 16:00:37,940 - INFO - Creating relations(child) index
2024-07-01 16:00:38,686 - INFO - Creating features(featuretype) index
2024-07-01 16:00:39,703 - INFO - Creating features (seqid, start, end) index
2024-07-01 16:00:41,177 - INFO - Creating features (seqid, start, end, strand) index
2024-07-01 16:00:42,823 - INFO - Running ANALYZE features
Traceback (most recent call last):
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/bin/lifton", line 8, in <module>
    sys.exit(main())
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/lifton.py", line 352, in main
    run_all_lifton_steps(args)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/lifton.py", line 290, in run_all_lifton_steps
    tree_dict = intervals.initialize_interval_tree(l_feature_db, features)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/lifton/intervals.py", line 12, in initialize_interval_tree
    tree_dict[chromosome].add(gene_interval)
  File "/slurm/home/zju/zhanglab/chenquanyu/mambaforge/envs/liftoff/lib/python3.10/site-packages/intervaltree/intervaltree.py", line 324, in add
    raise ValueError(
ValueError: IntervalTree: Null Interval objects not allowed in IntervalTree: Interval(45020029, 45020029, 'CDS_51812')

When I look at the gff file I provided, which was downloaded from NCBI (GRCh38 refseq), I found there are a few identical ids which may cause the error in miniprot (while liftoff created unique ids):

$rg -v '^#' ~/rawdata/GRCh38/ref/GCF_000001405.40_GRCh38.p14_genomic.gff | cut -f 9 | awk -F '[=|;]' '{print $2}' | sort | uniq -c | sort -nr | head
362 cds-NP_001254479.2
358 cds-XP_016860308.1
335 cds-XP_016860310.1
335 cds-XP_016860309.1
316 cds-XP_047301616.1
312 cds-XP_047301617.1
312 cds-NP_001243779.1
311 cds-NP_596869.4
309 cds-XP_024308863.1
299 cds-XP_047301619.1

so I think you'd better edit the performance of miniprot.. Best wishes!