NCBI-Hackathons / Master_gff3_parser

Convert sequence IDs between ucsc/refseq/genbank
MIT License
16 stars 5 forks source link

Semi Successful test for USCS cat 4 to refseq ids #10

Open childers opened 7 years ago

childers commented 7 years ago
$ time  seqconv convert --ref felCat4 --out rs cat_felCat4_UCSC_2008.gtf >test_cat_4.rs.gtf
Converting from None to rsStarting Conversion
Cannot convert id: chrM
No corresponding id for chrX from None
FORMAT detected: uc
real    0m24.379s
user    0m1.858s
sys 0m0.188s
childers commented 7 years ago

The resulting gtf only contains a link to the assembly report:

$ wc -l cat_felCat4_UCSC_2008.gtf 
    1000 cat_felCat4_UCSC_2008.gtf
$ wc -l test_cat_4.rs.gtf 
       1 test_cat_4.rs.gtf

$ cat test_cat_4.rs.gtf 
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/003/115/GCA_000003115.1_catChrV17e/GCA_000003115.1_catChrV17e_assembly_report.txt
childers commented 7 years ago

The assembly_report file shows 'X', while the UCSC gtf file uses the format 'ChrX'. Assembly_report

F1      assembled-molecule      F1      Chromosome      CM000711.1      <>      na      Primary Assembly        92851383        chrF1
F2      assembled-molecule      F2      Chromosome      CM000712.1      <>      na      Primary Assembly        81418843        chrF2
X       assembled-molecule      X       Chromosome      CM000713.1      <>      na      Primary Assembly        145558876       chrX
chrUn1_1        unplaced-scaffold       na      na      ACBE01511744.1  <>      na      Primary Assembly        3005    chrUn_ACBE01511744
chrUn1_3106     unplaced-scaffold       na      na      ACBE01511745.1  <>      na      Primary Assembly        3953    chrUn_ACBE01511745
chrUn1_7159     unplaced-scaffold       na      na      ACBE01511746.1  <>      na      Primary Assembly        1488    chrUn_ACBE01511746

Cat GTF

$ head cat_felCat4_UCSC_2008.gtf
chrM    felCat4_gold    exon    1       17009   0.000000        +       .       gene_id "NC_001700"; transcript_id "NC_001700"; 
chrX    felCat4_gold    exon    1       3694    0.000000        +       .       gene_id "ACBE01484836.1"; transcript_id "ACBE01484836.1"; 
chrX    felCat4_gold    exon    16589   17861   0.000000        -       .       gene_id "ACBE01484837.1"; transcript_id "ACBE01484837.1"; 
childers commented 7 years ago

@guilhemfaure How should we handle this case?