bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
985 stars 353 forks source link

update yaml templates #1575

Closed saboswell closed 7 years ago

saboswell commented 7 years ago

Can you possibly add a template for a PairedEnd RNAseq experiment?

And a single end alignment using the genome file that has the ERCC spike ins appended?

Also would it be possible toupdate the current example to be more current (with star alignment)?

Sarah

roryk commented 7 years ago

Hi Sarah,

There isn't anything special that has to get done for a paired end experiment, the same template works for both. bcbio detects if a sample is paired by checking if there are two or one FASTQ files for the sample.

To use ERCC spike ins, you'd have to create a separate genome file with the spike ins appended to them using the bcbio_setup_genome.py script with the --ercc option turned on. Then you could use that ERCC-appended genome to align to.

The example doesn't have STAR as the default aligner because it takes a lot of memory and not everyone has machines with enough memory to run it.

Hope that helps some!

roryk commented 7 years ago

I could set up an ERCC spiked genome on Orchestra if you let me know which one you want to add spike-ins to.

roryk commented 7 years ago

An alternative way to handle ERCC spike ins without creating a new genome, if it is just a one off thing is to run the fast RNA-seq pipeline with a custom transcriptome file with the ERCC spike in sequenced added to the end, and then use those estimates downstream.

saboswell commented 7 years ago

I have run pipelines with ERCC added on before. It would be great to just add it on to the current mouse and human genomes. If you don't have ERCC then it shouldn't really matter. I thought Lorena did something like this already.

roryk commented 7 years ago

Great, maybe we're set up to do that already. I'll take a look in a bit, straightening out some account issues with HMS now.

roryk commented 7 years ago

Lorena is adding some functionality to bcbio for another project to do this type of spike in quantification on the fly, so we don't have to regenerate the genomes. It looks like we're good to go already with hg19 on Orchestra, you can use the spike ins with hg19-ercc as the genome.

roryk commented 7 years ago

Hi Sarah,

@lpantano added an option to quantitate spike ins by passing a FASTA file of the spikein sequences via the spikein_fasta option, so we no longer have to generate new genomes to use spikeins. Thanks for the suggestion!

saboswell commented 7 years ago

Cool! That is really great! Will hg19-ercc still work? Does this run as a default or do we need to specify it in the yaml? and where to I add it to the yaml?

I especially want to clarify as I've put together a quick how - to - run BCBIO for RNAseq for folks over here and want to be sure I'm giving them the correct info on the spike ins since we always use them.

Thanks again everyone!

roryk commented 7 years ago

Hi Sarah,

Sure-- is this for single-cell or for bulk sequencing? hg19-ercc will still work, but it is unnecessary. You can quantitate spike ins by passing a FASTA file of the sequences to quantitate in the configuration via the spikein_fasta option, under the algorithm section. There will be a separate file spikein.sf in the upload directory that has a tidy format version of the spike in counts. @lpantano do you think you could bump the Orchestra install up to the development version so these changes get added?

stephenturner commented 7 years ago

I'm trying to run the fast rna-seq analysis, but can't figure out how to do so. I've tried analysis: fast_rnaseq and analysis: fast RNA-seq, but I get an error each time:

Cannot determine which type of analysis to run, set in the run_info under details.

What should my yaml look like to run the salmon only analysis?

Related to that, can different pieces of the regular RNA-seq pipeline be turned on and off?

roryk commented 7 years ago

Hi Stephen,

Hope all is well. analysis: fastrna-seq will do the trick. You can also pass a different set of transcripts to quantitate against if you pass in the transcriptome_fasta option.

You can turn variant calling, transcript assembly and fusion calling on and off; we keep the alignment plus Sailfish quantitation on no matter what because for some of the QC metrics you need the alignments, and we think that running a RNA-seq experiment without the QC metrics only gives you part of the important picture. The most recent version of Salmon can spit out the actual quasialignments so we are planning on adding the QC stuff to the fastRNA-seq pipeline using those so we can get the best of both worlds.

stephenturner commented 7 years ago

Thanks. Been using bcbio-nextgen for a week and it's great.

stephenturner commented 7 years ago

i'm running into problems with a template that looks like this:

details:
  - analysis: fastrna-seq
    genome_build: mm10
    algorithm:
upload:
  dir: ../final

since there are no algorithm parameters to set (?) not sure what I should do here.

/home/sdt5z/bcbio/local/share/bcbio/anaconda/lib/python2.7/site-packages/bcbio/pipeline/run_info.py in _add_algorithm_defaults(algorithm=None)
    792                 "validate": None,
    793                 "validate_regions": None}
    794     convert_to_list = set(["archive", "tools_off", "tools_on", "hetcaller", "variantcaller", "qc", "disambiguate"])
    795     convert_to_single = set(["hlacaller", "indelcaller", "validate_method"])
    796     for k, v in defaults.items():
--> 797         if k not in algorithm:
        k = 'nomap_split_targets'
        algorithm = None
    798             algorithm[k] = v
    799     for k, v in algorithm.items():
    800         if k in convert_to_list:
    801             if v and not isinstance(v, (list, tuple)):

TypeError: argument of type 'NoneType' is not iterable
roryk commented 7 years ago

Hi Stephen,

Thanks, sorry about the problem-- I pushed a fix that will handle an empty algorithm section cleanly. Practically you can get around this issue without updating by doing:

algorithm: {} instead of algorithm:.

schelhorn commented 7 years ago

@roryk, could we get counts as well as TPMs for the spike-ins? If I remember correctly it currently only generates counts. We routinely quantitate ERCC and HPV L1 from all reference strains using Sailfish by adding the spike-ins to the reference (a hack), and it would be great to have TPM for there in the bcbio standard build as well so we could leave the reference untouched. I'd be happy to share the HPV reference .fa/.gtf in return, so HPV genotyping could become a standard feature in bcbio :)

chapmanb commented 7 years ago

Sven-Eric -- we've been looking into adding HPV genotyping to bcbio and Rory reminded me that you'd offered to share the reference you use. Is this still available for sharing? We'd love to hear about how you do this and figure out how to get it in place for both DNA and RNA-seq pipelines. Thanks much.

chapmanb commented 7 years ago

Sven-Eric -- Rory pointed me at this other thread I missed with the HPV you're using (https://github.com/chapmanb/bcbio-nextgen/issues/1709#issuecomment-272859939). We discussed a bit and are currently aiming at using the viral sequence file from TCGA GDC:

https://gdc.cancer.gov/about-data/data-harmonization-and-generation/gdc-reference-files

which has HPV plus other potentially interesting viral sequences. Our plan on the variant calling side will be to map unmapped reads to this fasta and report counts in a MultiQC table. Hopefully this works with the outputs you'd need as well.

schelhorn commented 7 years ago

Hello Brad and Rory,

firstly thank you very much for approaching this issue. It's a complicated issue with oncoviruses; for a number of reasons purely focusing on DNA offers relatively low signal to noise ratios due to various viral mechanisms that result in few viral genome copies identifiable by DNA-based library prep. In addition, many viruses such as HCV do not have a DNA representation (i.e., they are RNA viruses without DNA intermediate). Last, the clinical (i.e., diagnostically relevant) definition of viral genotypes most often is based only on selected, variable sequence regions that commonly correspond to specific viral transcripts.

For these three reasons, looking on RNA instead of on DNA makes a lot of sense. Since all directly oncogenic viruses have to produce mRNA (regardless of their replication mode), there will be many more copies of viral mRNA of specific transcripts than of viral genomes. Obviously, this will also capture RNA viruses, even for poly-A enriched library preps that may not show th viral RNA-genomes but just their mRNA transcripts. Last, RNA-based analysis allows focused investigation of exactly the viral regions that are used for clinical genotyping (i.e. transcripts encoding for capsid and envelope proteins that are variable to allow for viral immune evasion but that at the same time are transcribed at high copy numbers in directly oncogenic viruses). As a bonus, methods for quantifying transcripts (i.e., Sailfish and Salmon) will implicitly deal with the mapping uncertainty of transcripts originating from closely homologous viral genotypes; this important feature one doesn't get from variant calling workflows at all. Relating this that, TPM values are great as well since they are directly comparable between sequencing libraries.

The only benefit from purely DNA-based analysis is detection of fully silenced viral DNA genomes (i.e., full latency), and the discovery of viral integration sites (which is notoriously difficult for SV callers, however). This will require sufficient sequencing depth for detecting low viral genome copy numbers, though, and is also less of a smoking gun than finding viral transcripts.

So I guess what I am trying to say is that I am happy to have the proposed approach included in bcbio's DNA-based variant calling pipeline, but that I personally see more utility in RNA-Seq quantitation for the reasons mentioned above. If any viral references are used for RNA-Seq in conjunction with Sailfish/Salmon (which I'd recommend), transcripts specific for clinical genotyping (such as the L1 regions for HPV that I provided) are sufficient and directly interpretable. Whole viral genomes should only be used if the quantitation method that is employed receives a proper viral genome annotation so that it knows that different parts of the genome are expected to generate wildly (3 orders of magnitude) different RNA copy numbers.

It is for these reasons that the GDC references are not suitable for RNA-based analysis (and also not completely for DNA-based variant calling, since there is at least one RNA virus on the list that has no DNA intermediate, which makes me think that the GDC people may not have thought too deeply about the matter ;)).

Last, regarding the DNA-Seq read counts in the MultiQC table, these will depend on the sequencing depth and the relative length of the viral genomes (which may differ a lot even between viral genotypes of the same genus), and will have to differentiate between primary and secondary alignments (high sequence similarity between genotypes). How would you like to approach these issues?

ohofmann commented 7 years ago

Sven-Eric, points well taken! I've heard similar things from people who are trying to pinpoint HPV insertion sites, and we've had similar experiences with HIV integration site detection.

That said, I don't see us routinely do RNA-Seq for patient data just yet, and we don't necessarily need the insertion site details at this point. I've bee thinking more along the lines of a Kraken (or similar) run against a viral genome database or some sketch approach ala http://ivory.idyll.org/blog/2017-sourmash-sra-microbial-wgs.html . Do we have test genomes with/without known HPV insertions?

chapmanb commented 7 years ago

Sven-Eric; Thanks for the detailed overview. I appreciate all the thought you've given this. It sounds like our best approach is to take a rough and ready first stab with counting making use of Kraken or bwa and then think more deeply about how to incorporate into RNA-seq and improve the DNA-seq usage. My first pass wasn't going to try getting into insertion sites, which Rory has done for projects internally and I know is a pretty big research project. Practically I only wanted to have counts of viral sequences present and am perhaps too optimistic that this will be useful without further refinement, but can at least put that in place as a starting point.

schelhorn commented 7 years ago

All good with me as long as the option for specifying additional transcripts for RNA-Seq quantitation remains. We often do RNA-Seq in clinical settings so that's our preferred way to do this. Check out ViralFusionSeq, VirusFinder 2, Vy-PER, and the VISA web service for references to simulated and experimental DNA-Seq data sets containing proviral integration sites. You'll only find clonal insertions, though ;) The minhash approach seems to be a good replacement for kraken.

schelhorn commented 7 years ago

Also, @ohofmann, this preprint just popped up: http://biorxiv.org/content/early/2017/02/01/105080

ohofmann commented 7 years ago

Ah, interesting. Thought this would never get published -- it's the SV caller the PCAWG SV group has been developing / using for the pancancer analysis. I did not know they also applied it to viral integration questions. Code is at https://github.com/walaj/svaba, might be worthwhile testing and packaging.

chapmanb commented 7 years ago

Sven-Eric and all; Thank you again for all the suggestions and thoughts on this. I implemented a first version of viral detection for DNA-seq somatic variant calling. It takes unmapped reads and uses bwa to align against viral sequences from the GDC TCGA distribution, then reports the counts in the MultiQC report. I'll do some testing on projects to evaluate how well it does at recovering useful viral counts and we can continue to iterate and improve this.

On the SV side, SvABA definitely looks worth investigating. Jeremiah does good work.