biocommons / uta

Universal Transcript Archive: comprehensive genome-transcript alignments; multiple transcript sources, versions, and alignment methods; available as a docker image
Apache License 2.0
62 stars 26 forks source link

feat(IPVC-2276): skip over accessions that are not present in seqrepo and log them to a file #257

Closed sptaylor closed 7 months ago

sptaylor commented 7 months ago

The exonset file, derived from gff files, contains transcript accessions that are not present in Seqrepo. The original script raises a RuntimeError in this case, but this PR will just skip over txs missing from Seqrepo and log them to a file.

Demo:

2024-04-02 14:42:05 INFO     [__main__] loaded /opt/repos/uta/src/uta/../../etc/global.conf
2024-04-02 14:42:05 INFO     [__main__] opened /workdir/loading/full_dataset.exonsets.gz
2024-04-02 14:42:05 INFO     [__main__] Opened sequence directories: /usr/local/share/seqrepo/2024-02-20
2024-04-02 14:42:05 INFO     [__main__] Writing seqinfo to stdout
2024-04-02 14:45:48 WARNING  [root] Sequence not found: NM_001005170.3
2024-04-02 14:47:48 WARNING  [root] Sequence not found: NM_001105281.5
2024-04-02 14:52:35 WARNING  [root] Sequence not found: NM_001265615.2
2024-04-02 14:53:13 WARNING  [root] Sequence not found: NM_001278392.2
2024-04-02 15:02:51 WARNING  [root] Sequence not found: NM_001363863.2
2024-04-02 15:04:26 WARNING  [root] Sequence not found: NM_001370640.3
2024-04-02 15:07:04 WARNING  [root] Sequence not found: NM_001386206.2
2024-04-02 15:08:13 WARNING  [root] Sequence not found: NM_001394030.1
2024-04-02 15:08:13 WARNING  [root] Sequence not found: NM_001394034.1
2024-04-02 15:08:16 WARNING  [root] Sequence not found: NM_001394148.1
2024-04-02 15:08:16 WARNING  [root] Sequence not found: NM_001394149.1
2024-04-02 15:08:30 WARNING  [root] Sequence not found: NM_001394650.1
2024-04-02 15:08:41 WARNING  [root] Sequence not found: NM_001395215.1
2024-04-02 15:08:41 WARNING  [root] Sequence not found: NM_001395223.1
2024-04-02 15:08:43 WARNING  [root] Sequence not found: NM_001395278.1
2024-04-02 15:08:46 WARNING  [root] Sequence not found: NM_001395417.1
2024-04-02 15:08:46 WARNING  [root] Sequence not found: NM_001395421.1
2024-04-02 15:08:47 WARNING  [root] Sequence not found: NM_001395462.1
2024-04-02 15:08:50 WARNING  [root] Sequence not found: NM_001395637.1
2024-04-02 15:08:52 WARNING  [root] Sequence not found: NM_001395847.1
2024-04-02 15:08:54 WARNING  [root] Sequence not found: NM_001395979.1
2024-04-02 15:08:55 WARNING  [root] Sequence not found: NM_001396036.1
2024-04-02 15:09:30 WARNING  [root] Sequence not found: NM_001401501.1
2024-04-02 15:09:31 WARNING  [root] Sequence not found: NM_001401686.1
2024-04-02 15:09:31 WARNING  [root] Sequence not found: NM_001401690.1
2024-04-02 15:17:28 WARNING  [root] Sequence not found: NM_153699.2
2024-04-02 15:28:45 WARNING  [root] Sequence not found: NR_171775.1
2024-04-02 15:29:16 INFO     [root] 27 accessions were not found in Seqrepo. See /workdir/loading/exonset_missing_accessions.txt.