Closed mdshw5 closed 7 years ago
Hi @mdshw5,
Nope; absolutely not expected behavior (if the transcripts are of sufficient length); thanks for reporting this! Does this only happen for you under the perfectHash option? I'd be happy to take a look if there is an easy way to get the same base txome.
Hi @mdshw5,
I'm still interested in trying to figure out what might be going on here; any updates?
No updates right now. Consider this issue not reproduced yet, as I haven't had time to dig into the details. Hopefully it's an issue on my end, but expect an update in the next couple of days.
Still looking fishy ;P (pun intended)?
How did you know I just starting looking in to this again? :)
Hey Rob. It looks like this was an error in the way I was calling salmon index
. I've wrapped salmon in a python based pipeline where I manage creation of index files using configuration files. To call salmon index
I was previously iterating on standard error, capturing your err and logging it after reformatting a bit. It looks like what was happening is:
salmon quant
I fixed this by doing the right thing and blocking for the process to return an exit code:
p = Popen(cmd, stderr=PIPE)
- for line in p.stderr:
- line = line.decode()
- if line.endswith('\n'):
- logging.info(line.rstrip())
- else:
- logging.info(line)
+ _, err = p.communicate()
+ logging.info(err)
Hey Matt,
First, thanks for the detailed analysis! Second, phewww --- I looked for a while in the indexer and didn't see anything that could have caused lost transcripts, so I'm glad that's not the case. It sounds like you had to go down a bit of a rabbit hole to figure this out. Anyway, I'll take a look at where Salmon might be producing an EOF marker on stderr anyway (I'd like to avoid that behavior if I'm indeed doing that). Thanks again for reporting back on this! I'll close the issue for now since it seems resolved.
Yeah, this is definitely not your issue. In fact, I just figured out that my explanation above was incomplete. You don't need to investigate anything on your end. I simply didn't flush the entire contents of a FASTA file to disk before calling salmon index
. In the course of tracking down the issue I fixed some of my code indentation, bringing some of my code into a more global scope, where the with
context handler I was using to hold the FASTA file open went out of scope, flushing my final writes to disk. Sigh...
I know it is a old post but I just found it because I was googling the rabbit hole I have been diving into. Could it be because GENCODE annotation contains 160 paralogs annotated (PAR in the tag column)? An example is:
seqnames ranges strand | gene_id transcript_id tag
<Rle> <IRanges> <Rle> | <character> <character> <character>
[1] chrX 155997581-156010608 + | ENSG00000124334.17 ENST00000244174.10 CCDS
[2] chrY 57184101-57197128 + | ENSG00000124334.17_PAR_Y ENST00000244174.10_PAR_Y PAR
When I look in ensemble only the X-chromosome version exists.
Specifically it is "annotation in the pseudo-autosomal region, which is duplicated between chromosomes X and Y. accoriding to this."
@kvittingseerup I'm glad that I found your comment. Thanks for providing the GENCODE explanation.
I'm running into an issue with Salmon 0.8.2 where I've prepared a two FASTA files for indexing:
The above is just an example showing that, while the files contain different transcripts, there are transcripts that are shared in common between the two.
Now, I index these files, passing the options: --type quasi --perfectHash
After indexing, one of the indices has the transcript and the other does not:
The transcripts that are dropped do not seem strange in any way (no excessive polyA and normal length). Is it expected behavior for salmon to drop transcripts during indexing?
Thanks!