Some RefSeq transcripts are being incorrectly filtered out by the strict regex in use based on their stable_id.
Currently, the following stable_id are available from RefSeq transcripts:
Script to print transcript's `stable_id` from cache
You can use the following script to get the transcript's `stable_id`:
```perl
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Storable qw(nstore_fd fd_retrieve);
foreach my $f(@ARGV){
open my $fh, "gzip -dc $f |" or throw("ERROR: $!");
my $obj = fd_retrieve($fh);
print $_->{stable_id}, "\n" for @{$obj->{MT}};
}
```
This PR fixes the regex used to avoid filtering out transcript stable_ids like ND4L and ND1 (compmerge transcripts are ignored).
Changelog
Fix regex to capture all expected mitochondrial genes
Simplify filtering of merged and refseq cache to make it more concise and easy to understand
Testing
Run VEP with and without the hidden flag --all_refseq and check if only the compmerge RefSeq transcripts are filtered out when using refseq and merged cache. All other transcripts should be present.
Check if ND4L and ND1 to ND6 are present in the results.
Fixes #1695
Some RefSeq transcripts are being incorrectly filtered out by the strict regex in use based on their
stable_id
.Currently, the following
stable_id
are available from RefSeq transcripts:Script to print transcript's `stable_id` from cache
You can use the following script to get the transcript's `stable_id`: ```perl #!/usr/bin/env perl use strict; use warnings; use Data::Dumper; use Storable qw(nstore_fd fd_retrieve); foreach my $f(@ARGV){ open my $fh, "gzip -dc $f |" or throw("ERROR: $!"); my $obj = fd_retrieve($fh); print $_->{stable_id}, "\n" for @{$obj->{MT}}; } ```This PR fixes the regex used to avoid filtering out transcript
stable_id
s likeND4L
andND1
(compmerge
transcripts are ignored).Changelog
merged
andrefseq
cache to make it more concise and easy to understandTesting
Run VEP with and without the hidden flag
--all_refseq
and check if only thecompmerge
RefSeq transcripts are filtered out when usingrefseq
andmerged
cache. All other transcripts should be present.Check if
ND4L
andND1
toND6
are present in the results.