Transcripts dropped while indexing

COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment

https://combine-lab.github.io/salmon

GNU General Public License v3.0

780 stars 165 forks source link

Transcripts dropped while indexing #132

Closed mdshw5 closed 7 years ago

mdshw5 commented 7 years ago

I'm running into an issue with Salmon 0.8.2 where I've prepared a two FASTA files for indexing:

grep -A1 ENST00000244174.10_PAR_Y transcriptsA.fasta                                       
>ENST00000244174.10_PAR_Y
AGCAGCTCTGTAATGCGCTTGTGGTTTCAGATGTGGGCGGCCTGT
...
grep -A1 ENST00000244174.10_PAR_Y transcriptsB.fasta                                       
>ENST00000244174.10_PAR_Y
AGCAGCTCTGTAATGCGCTTGTGGTTTCAGATGTGGGCGGCCTGT
...

The above is just an example showing that, while the files contain different transcripts, there are transcripts that are shared in common between the two.

Now, I index these files, passing the options: --type quasi --perfectHash

After indexing, one of the indices has the transcript and the other does not:

grep ENST00000244174.10_PAR_Y transcriptsA/salmon/txpInfo.bin 
Binary file transcriptsA/salmon/txpInfo.bin matches

grep ENST00000244174.10_PAR_Y transcriptsB/salmon/txpInfo.bin

The transcripts that are dropped do not seem strange in any way (no excessive polyA and normal length). Is it expected behavior for salmon to drop transcripts during indexing?

Thanks!

rob-p commented 7 years ago

Hi @mdshw5,

Nope; absolutely not expected behavior (if the transcripts are of sufficient length); thanks for reporting this! Does this only happen for you under the perfectHash option? I'd be happy to take a look if there is an easy way to get the same base txome.

rob-p commented 7 years ago

Hi @mdshw5,

I'm still interested in trying to figure out what might be going on here; any updates?

mdshw5 commented 7 years ago

No updates right now. Consider this issue not reproduced yet, as I haven't had time to dig into the details. Hopefully it's an issue on my end, but expect an update in the next couple of days.

rob-p commented 7 years ago

Still looking fishy ;P (pun intended)?

mdshw5 commented 7 years ago

How did you know I just starting looking in to this again? :)

mdshw5 commented 7 years ago

Hey Rob. It looks like this was an error in the way I was calling salmon index. I've wrapped salmon in a python based pipeline where I manage creation of index files using configuration files. To call salmon index I was previously iterating on standard error, capturing your err and logging it after reformatting a bit. It looks like what was happening is:

I opened a subprocess and executed salmon
Salmon worked properly
Salmon stopped producing output on stderr (and sent an EOF marker?) and so my script exited
- killing salmon prematurely
- truncating the salmon index (In a way that salmon found perfectly acceptable during salmon quant
- frustrating me quite a bit

I fixed this by doing the right thing and blocking for the process to return an exit code:

                 p = Popen(cmd, stderr=PIPE)
-                for line in p.stderr:
-                    line = line.decode()
-                    if line.endswith('\n'):
-                        logging.info(line.rstrip())
-                    else:
-                        logging.info(line)
+                _, err = p.communicate()
+                logging.info(err)

rob-p commented 7 years ago

Hey Matt,

First, thanks for the detailed analysis! Second, phewww --- I looked for a while in the indexer and didn't see anything that could have caused lost transcripts, so I'm glad that's not the case. It sounds like you had to go down a bit of a rabbit hole to figure this out. Anyway, I'll take a look at where Salmon might be producing an EOF marker on stderr anyway (I'd like to avoid that behavior if I'm indeed doing that). Thanks again for reporting back on this! I'll close the issue for now since it seems resolved.

mdshw5 commented 7 years ago

Yeah, this is definitely not your issue. In fact, I just figured out that my explanation above was incomplete. You don't need to investigate anything on your end. I simply didn't flush the entire contents of a FASTA file to disk before calling salmon index. In the course of tracking down the issue I fixed some of my code indentation, bringing some of my code into a more global scope, where the with context handler I was using to hold the FASTA file open went out of scope, flushing my final writes to disk. Sigh...

kvittingseerup commented 5 years ago

I know it is a old post but I just found it because I was googling the rabbit hole I have been diving into. Could it be because GENCODE annotation contains 160 paralogs annotated (PAR in the tag column)? An example is:

      seqnames              ranges strand |                  gene_id            transcript_id         tag
         <Rle>           <IRanges>  <Rle> |              <character>              <character> <character>
  [1]     chrX 155997581-156010608      + |       ENSG00000124334.17       ENST00000244174.10        CCDS
  [2]     chrY   57184101-57197128      + | ENSG00000124334.17_PAR_Y ENST00000244174.10_PAR_Y         PAR

When I look in ensemble only the X-chromosome version exists.

Specifically it is "annotation in the pseudo-autosomal region, which is duplicated between chromosomes X and Y. accoriding to this."

MohamedRefaat92 commented 5 years ago

@kvittingseerup I'm glad that I found your comment. Thanks for providing the GENCODE explanation.