cgat-developers / cgat-apps

cgat-apps repository
Other
33 stars 14 forks source link

Wrong columns in `--sanitize="ucsc" --assembly-report=FILE`? #115

Open IanSudbery opened 2 years ago

IanSudbery commented 2 years ago

gff2gff2 --method=sanitize --assembly-report=FILE has wrong default columns as far as I can tell. As far as I can tell the ensembl name is in column 0 and the ucsc name in column 9. Why does gff2gff use column 4 by default? This contains the GenBank-Accn:

# Sequence-Name Sequence-Role   Assigned-Molecule       Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn     Assembly-Unit   Sequence-Length UCSC-style-name
1       assembled-molecule      1       Chromosome      CM000663.2      =       NC_000001.11    Primary Assembly        248956422       chr1
2       assembled-molecule      2       Chromosome      CM000664.2      =       NC_000002.12    Primary Assembly        242193529       chr2
3       assembled-molecule      3       Chromosome      CM000665.2      =       NC_000003.12    Primary Assembly        198295559       chr3
4       assembled-molecule      4       Chromosome      CM000666.2      =       NC_000004.12    Primary Assembly        190214555       chr4
5       assembled-molecule      5       Chromosome      CM000667.2      =       NC_000005.10    Primary Assembly        181538259       chr5
6       assembled-molecule      6       Chromosome      CM000668.2      =       NC_000006.12    Primary Assembly        170805979       chr6
7       assembled-molecule      7       Chromosome      CM000669.2      =       NC_000007.14    Primary Assembly        159345973       chr7
8       assembled-molecule      8       Chromosome      CM000670.2      =       NC_000008.11    Primary Assembly        145138636       chr8
9       assembled-molecule      9       Chromosome      CM000671.2      =       NC_000009.12    Primary Assembly        138394717       chr9
10      assembled-molecule      10      Chromosome      CM000672.2      =       NC_000010.11    Primary Assembly        133797422       chr10
11      assembled-molecule      11      Chromosome      CM000673.2      =       NC_000011.10    Primary Assembly        135086622       chr11
12      assembled-molecule      12      Chromosome      CM000674.2      =       NC_000012.12    Primary Assembly        133275309       chr12
13      assembled-molecule      13      Chromosome      CM000675.2      =       NC_000013.11    Primary Assembly        114364328       chr13
14      assembled-molecule      14      Chromosome      CM000676.2      =       NC_000014.9     Primary Assembly        107043718       chr14
15      assembled-molecule      15      Chromosome      CM000677.2      =       NC_000015.10    Primary Assembly        101991189       chr15
16      assembled-molecule      16      Chromosome      CM000678.2      =       NC_000016.10    Primary Assembly        90338345        chr16
17      assembled-molecule      17      Chromosome      CM000679.2      =       NC_000017.11    Primary Assembly        83257441        chr17
18      assembled-molecule      18      Chromosome      CM000680.2      =       NC_000018.10    Primary Assembly        80373285        chr18
19      assembled-molecule      19      Chromosome      CM000681.2      =       NC_000019.10    Primary Assembly        58617616        chr19
20      assembled-molecule      20      Chromosome      CM000682.2      =       NC_000020.11    Primary Assembly        64444167        chr20
21      assembled-molecule      21      Chromosome      CM000683.2      =       NC_000021.9     Primary Assembly        46709983        chr21
22      assembled-molecule      22      Chromosome      CM000684.2      =       NC_000022.11    Primary Assembly        50818468        chr22
X       assembled-molecule      X       Chromosome      CM000685.2      =       NC_000023.11    Primary Assembly        156040895       chrX
Y       assembled-molecule      Y       Chromosome      CM000686.2      =       NC_000024.10    Primary Assembly        57227415        chrY

This works for human/mouse, because we first copy the content of column 0 to column 4 for lines with "assembled_molecule", but it doesn't work for genomes without assembled molecules.