COMBINE-lab / salmon

🐟 🍣 🍱 Highly-accurate & wicked fast transcript-level quantification from RNA-seq reads using selective alignment
https://combine-lab.github.io/salmon
GNU General Public License v3.0
776 stars 164 forks source link

Multiple questions and seeking advice #401

Open PlantDr430 opened 5 years ago

PlantDr430 commented 5 years ago

Hello,

I have some questions about functions within the software and mainly I am seeking some advice on parameter setting as I believe the developers might have a better idea of how to tackle this problem I am working with.

To start, I am helping my adviser with this as a side project so I have limited knowledge on the topic and I also have limited knowledge on RNAseq as I mostly deal with full genomes. Lastly, I also do know that this research question is very narrow and I believe the concept can be done by Salmon, but this is where I would need your feedback.

The idea is to look for the number of reads that map to two versions of a gene. The gene is split into two version by exon skipping and is not alternatively spliced. I also want to make this process fast as the idea would be to look for differences in the proportions of gene versions that are created based on a large databases of RNAseq data (easily 200+ different RNAseq experiments). So to make it quick I am only passing two transcripts to Salmon (T - 1, and T - 2) for version 1 and 2 of the transcript, where version 2 has the 2nd exon (of 4 total exons) removed.

Now I know Salmon was created to map reads to a large number of transcripts across the whole genome, but I believe it still should be possible to narrow down the view to only 1 gene with 2 versions. I believe I just need to set the parameters right, but I also want to set the parameters in a general way so that my script can work across different species with different input RNAseq data. The other problem is that we currently do not have an idea of what proportion of these two versions of the gene should actually exist in the RNAseq data I have (which we didn't perform but just grabbed a random sample from GenBank to test with). My adviser wants to first try and test it computationally first and then verify it in the lab (which is somewhat backwards in my mind, as it's really just a shot in the dark and from my preliminary analysis of Salmon, different parameters can drastically change the proportions of the two versions).

As you can see below, I have tried some parameter settings that I thought would be helpful (particularly --quasiCoverage). But again I could be wrong and would like to know your opinions in the matter.

These runs were all performed with this 'default' run:

salmon quant -i index -l A -1 reads_1.fq -2 reads_2.fq --validateMappings -p 20 --numPreAuxModelSamples 250 --numAuxModelSamples 1000 -o output 

I changes the AuxModelSamples to low values as I was generally only mapping 6000 reads to the two transcripts in total, so I didn't think they were working at the default settings.

Below is also a small chart of some of my runs (which included quasi-mapping and selective-alignment), but what you can get is that there is a large variance between parameters. Particularly --seqBias showed a dramatic drop in predicted T - 2 transcripts.

image

Any suggestions onto parameters settings to help me with this narrow question?

Now onto some questions regarding some outputs.

  1. I run into a segmentation fault (core dumped) when I try to run the experimental --posBias. I am new to RNAseq, but I thought this might help with this particular RNAseq set as it was a PolyA tail selection and when mapping to the full genome there is an observed heavy mapping to exon 1 with decreased mapping over exon 2, 3 and 4.

  2. During my runs I have been using -l A as I will be using this method in my script so that Salmon can just select the best libtype based on the given sequence to help make my script less complex. However, I found something strange when I used -l A on my test sequence. I got a warning of >1% strand bias on the command line and in the lib_format_counts.json I see that Salmon selected "IU" which I believe to be the correct libtype as well, but got this:

    {
    "read_files": [
        "/data/wyka/vamsi/SRR1174205_1_paired.fastq",
        "/data/wyka/vamsi/SRR1174205_2_paired.fastq"
    ],
    "expected_format": "IU",
    "compatible_fragment_ratio": 1.0,
    "num_compatible_fragments": 5549,
    "num_assigned_fragments": 5549,
    "num_frags_with_concordant_consistent_mappings": 4190,
    "num_frags_with_inconsistent_or_orphan_mappings": 1359,
    "strand_mapping_bias": 0.45441527446300719,
    "MSF": 0,
    "OSF": 0,
    "ISF": 2286,
    "MSR": 0,
    "OSR": 0,
    "ISR": 1904,
    "SF": 547,
    "SR": 812,
    "MU": 0,
    "OU": 0,
    "IU": 0,
    "U": 0
    }

    When my strand mapping was 0.45 (which to be honest I am not sure how bad that actually is), but I noticed that my reads were mapped with the ISF, ISR, SF, and SR libtypes. Is that normal?

PlantDr430 commented 5 years ago

Hello,

I've actually been thinking of a different method that would require very stringent mapping. By providing transcripts of only exon 1 & 2, exon 2 & 3, and exon 1 & 3 I could get a better idea of the number of reads that skip exon 2 all together. Also, by averaging the read counts that map to the junctions of exon 1 & 2 and exon 2 & 3, I can help eliminate polyA tail bias that is heavily positioned towards exon 1 and would also allow me to get a more accurate prediction of the two gene versions since 1 read mapped to exon 1 & 2 and 1 read mapped to exon 2 & 3 would essentially tell me twice that the gene is there while a read mapped to exon 1 & 3 would only tell me once that the gene is there. However, doing so would force me to bring AuxSampleNumber down to very low numbers such as 10 - 100 as using stringent coverage parameters drastically reduces my reads mapped.

I do wonder though how these low AUX numbers might affect your model development and algorithm. Any input into the aspect of low AUX numbers?

PlantDr430 commented 5 years ago

Hello,

Just checking in to see if you have any opinions of if setting really low AuxSampleNumber would have a negative affect on the overall algorithm and model development.

Thanks