Open smoe opened 1 year ago
Hi @smoe, the check for consistency between annotations is done here: https://github.com/BIMSBbioinfo/pigx_rnaseq/blob/master/scripts/validate_input_annotation.R
Could you maybe send me the links to the annotation files so that I can check this? There might be a bug in my validation script too.
An inconsistency doesn't kill the processes, though, it should print a warning only.
Hi @borauyar, The gtf is from http://ftp.ensembl.org/pub/release-107/gtf/oryctolagus_cuniculus/ , the cdna from http://ftp.ensembl.org/pub/release-107/fasta/oryctolagus_cuniculus/cdna/
From the validation script, the
message(date(), " Checking annotation files for potential issues")
does not appear in above shown (very short) logs, so the check likely was not performed since everything happened already. I would have expected that the settings are validated every time, even when my runs have already completed. Can I somehow execute that check directly?
Either way, the folks at https://github.com/COMBINE-lab/salmon should address this. They have a version 1.9.0 out (regular GUIX pigx-rnaseq is 1.6.0). Is there an easy way to check if salmon has changed its behaviour?
Oh I see. If you had run this from scratch, you should have received the warning. But, if you are running this after you have all the outputs, then it won't work. The validation script assumes that it is the first thing that runs so that the pipeline fails as early as possible. (although for this problem the pipeline wouldn't fail).
I think the problem is not with Salmon. It is up to the annotation database to have consistent nomenclature/ids between annotation files. I am sure there must be a reason why Ensembl provides such annotations.
Even if this specific issue is addressed for Ensembl, then you could have another source of annotations where you may have different kind of annotation inconsistencies. So, this should be fixed upstream I think.
The warning was likely scrolling up too quickly for me to notice it. So we agree that we prefer to see this fixed (by whomever) in salmon. I'll update the package locally and see how this goes.
Upstream is aware of that problem: https://github.com/COMBINE-lab/salmon/issues/598
Hello, I read through https://github.com/BIMSBbioinfo/pigx_rnaseq/issues/35 but somehow, even though I am using your latest GUIX-version, the missing gene names affect me:
The .fa file's gene identifier indeed has that version suffix that does not appear in the .gtf file:
with transcript_version and gene_version apparently providing the respective info, but starting pigx-rnaseq, there is no warning:
The hisat2 mapping apparently works just fine, cannot we somehow fix this by communicating with the salmon folks or whoever should be addressing this?