Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image
Apache License 2.0
62
stars
26
forks
source link
feat(IPVC-2276): skip over accessions that are not present in seqrepo and log them to a file #257
The exonset file, derived from gff files, contains transcript accessions that are not present in Seqrepo. The original script raises a RuntimeError in this case, but this PR will just skip over txs missing from Seqrepo and log them to a file.
Demo:
2024-04-02 14:42:05 INFO [__main__] loaded /opt/repos/uta/src/uta/../../etc/global.conf
2024-04-02 14:42:05 INFO [__main__] opened /workdir/loading/full_dataset.exonsets.gz
2024-04-02 14:42:05 INFO [__main__] Opened sequence directories: /usr/local/share/seqrepo/2024-02-20
2024-04-02 14:42:05 INFO [__main__] Writing seqinfo to stdout
2024-04-02 14:45:48 WARNING [root] Sequence not found: NM_001005170.3
2024-04-02 14:47:48 WARNING [root] Sequence not found: NM_001105281.5
2024-04-02 14:52:35 WARNING [root] Sequence not found: NM_001265615.2
2024-04-02 14:53:13 WARNING [root] Sequence not found: NM_001278392.2
2024-04-02 15:02:51 WARNING [root] Sequence not found: NM_001363863.2
2024-04-02 15:04:26 WARNING [root] Sequence not found: NM_001370640.3
2024-04-02 15:07:04 WARNING [root] Sequence not found: NM_001386206.2
2024-04-02 15:08:13 WARNING [root] Sequence not found: NM_001394030.1
2024-04-02 15:08:13 WARNING [root] Sequence not found: NM_001394034.1
2024-04-02 15:08:16 WARNING [root] Sequence not found: NM_001394148.1
2024-04-02 15:08:16 WARNING [root] Sequence not found: NM_001394149.1
2024-04-02 15:08:30 WARNING [root] Sequence not found: NM_001394650.1
2024-04-02 15:08:41 WARNING [root] Sequence not found: NM_001395215.1
2024-04-02 15:08:41 WARNING [root] Sequence not found: NM_001395223.1
2024-04-02 15:08:43 WARNING [root] Sequence not found: NM_001395278.1
2024-04-02 15:08:46 WARNING [root] Sequence not found: NM_001395417.1
2024-04-02 15:08:46 WARNING [root] Sequence not found: NM_001395421.1
2024-04-02 15:08:47 WARNING [root] Sequence not found: NM_001395462.1
2024-04-02 15:08:50 WARNING [root] Sequence not found: NM_001395637.1
2024-04-02 15:08:52 WARNING [root] Sequence not found: NM_001395847.1
2024-04-02 15:08:54 WARNING [root] Sequence not found: NM_001395979.1
2024-04-02 15:08:55 WARNING [root] Sequence not found: NM_001396036.1
2024-04-02 15:09:30 WARNING [root] Sequence not found: NM_001401501.1
2024-04-02 15:09:31 WARNING [root] Sequence not found: NM_001401686.1
2024-04-02 15:09:31 WARNING [root] Sequence not found: NM_001401690.1
2024-04-02 15:17:28 WARNING [root] Sequence not found: NM_153699.2
2024-04-02 15:28:45 WARNING [root] Sequence not found: NR_171775.1
2024-04-02 15:29:16 INFO [root] 27 accessions were not found in Seqrepo. See /workdir/loading/exonset_missing_accessions.txt.
The exonset file, derived from gff files, contains transcript accessions that are not present in Seqrepo. The original script raises a RuntimeError in this case, but this PR will just skip over txs missing from Seqrepo and log them to a file.
Demo: