epam / fonda

Fonda is a framework which offers scalable and automatic analysis of multiple NGS sequencing data types
Apache License 2.0
8 stars 3 forks source link

SCRnaExpression-CellRangerFastq - new style #164

Closed syansanofi closed 4 years ago

syansanofi commented 4 years ago

Issue

There are two issues. First, feature barcode syntax. Second, combining VDJ workflow with new scRNAExpression-CellRangerFastq workflow.

Feature Barcode

Currently, FONDA uses 'old' cellranger syntax. This supports a single library per sample. The design of this syntax predates antibody capture technology (CITE-SEQ). Cellranger currently supports CITE-SEQ analysis by using 'feature barcode' through a new syntax.

Old syntax cellranger count --id Sample1 --sample Sample1 --transcriptome /path/to/reference --fastqs /path/to/fastqs

New syntax cellranger count --id Sample1 --transcriptome /path/to/reference--libraries sample1_libraries.csv --feature-ref feature_reference.csv

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis

In essence, new syntax supports multi-library runs directly whereas previously it was required to input a custom 'Martian' pipeline file to cellranger. Therefore, the new syntax requires TWO new files.

Sample_libraries.csv

This file outlines library information with the following columns:

  1. fastqs - path to directory containing bcl2fast or cellranger mkfastq output
  2. sample - prefix of files in directory
  3. library_type - Antibody Capture, Custom, CRISPR Guide Capture

Some example situations for a single sample can be but not limited to:

Example 1: 1 Gene Expression only Fastqs sample library_type
/path/to/fastqs sample1 Gene Expression
Example 2: Matching Gene Expression + ADT Fastqs sample library_type
/path/to/fastqs sample1 Gene Expression
/path/to/fastqs sample1 Custom
Example 3: Non Matching Gene Expression + ADT Fastqs sample library_type
/path/to/fastqs sample1 Gene Expression
/path/to/fastqs sample1a Custom
Example 4: 1 ADT only Fastqs sample library_type
/path/to/fastqs sample1 Custom
Example 5: 2 ADT Fastqs sample library_type
/path/to/fastqs sample1hto Custom
/path/to/fastqs sample1adt Custom

In other words, most if not any combination of libraries are valid including sample groupings without gene expression where only feature barcodes were sequenced. Also, multiple sample Cellranger takes the --id field as the sample name. This characteristic is critical as current experimental designs often entail a one to many relationship for ADT to Gene Expression library pairing. For example: 10 Gex libraries matched against 1 Antibody Capture library.

feature-ref.csv

This csv file has the following columns:

  1. id - feature id
  2. name - feature name (can be seen as description)
  3. read - R1 or R2, which read contains the feature-barcode in the fastq?
  4. pattern - regex pattern for detecting reads (related to Total-Seq library type)
  5. sequence - nucleotide sequence for feature barcode
  6. feature_type - type of feature, hardcoded as of cellranger v3.1.0 (Antibody Capture, Custom, CRISPR Guide Capture)
  7. target_gene_id - CRISPR Guide target ID
  8. target_gene_name - CRISPR Guide target name

This file can be seen as an additional reference file to the transcriptome.

VDJ

The multi library design is not limited to cellranger count but also includes cellranger vdj. Experiments often include VDJ libraries as well. Currently the workflows are separate per design mentality of 10X genomics (advice to run cellranger 2x for count and vdj). This is still the general idea but it is necessary to bring VDJ into a unified COUNT+VDJ workflow in FONDA due to processing time. VDJ and COUNT should be parallelized, otherwise processing time would be unacceptable for daily use.

Proposal

Changes

1. Dynamic generation of sample_libraries.csv using sample_manifest.txt table to add parameters to the fastq sample entity for SCRnaExpression-CellRangerFastq

Following the example of DNAcaptureVar_Fastq design, we can add two columns (including or not including tumor / case columns) to the sample manifest table.

  1. library_type: this would specify to FONDA the library_type column of the library
  2. master: this would specify to FONDA the overall sample name. It is debatable if this should be a link to the master shortName or be an all together stand alone name. For example: sample1 (None), sample1adt (sample1), sample1hto (sample1) OR sample1gex (sample1), sample1adt (sample1), sample1hto (sample1)

2. Add feature-ref field to global_config database section.

Although through current testing we have not seen issues when using a comprehensive feature-ref.csv file that includes all possible library type and barcodes, it would be better to leave the management of this file outside of FONDA since this list could undergo modification and thus create conflicts. It is convenient to think of this file in a separate fashion to the 'BED' files in DnaCaptureVar_Fastq workflow.

3. Modify syntax of count template file to match new syntax

This is self explanatory.

4. Bring VDJ and COUNT together as toolset parameters in a new workflow

Current output from SCRnaExpression-CellRangerFastq has something similar to:

We can simply add VDJ as a new tool to the toolset and when FONDA parses sample manifest and sees vdj specify as library type, it could execute the VDJ tool and direct results to new VDJ output dir alongside COUNT.

See a rough sketch of the proposal. Apologies for the low quality output from MIRO.

new_fonda (1)

kamyshova commented 4 years ago

@syansanofi Hi, Shu. Would you like to implement these changes by yourself? Or should I start implementation? If so, it would be great to get test data from you.

syansanofi commented 4 years ago

@kamyshova Hi Yulia, thank you. I will email you an example run and its configs.

kamyshova commented 4 years ago

@syansanofi Hi, Shu. I support your idea about the master column. As I can see, we pass the column value to count tool as --id argument. In addition, the master column allows us to unambiguously map the sample_manifest.txt table entity to the final samples.

syansanofi commented 4 years ago

@kamyshova Hi Yulia. Thank you for the feedback. Then I think we can proceed as agreed. For future reference, we will have master, library type added and remove sample_type, match_control.

kamyshova commented 4 years ago

@syansanofi Hi, Shu. I've prepared the pull request containing the described changes. Could you check it, please? I've used your test data for verification and found two concerns:

  1. Count tool. Now if forcedCells == 'NA' then --expect-cells is passed as parameter. But --expect-cells can have 'NA' value. The new cellranger version doesn't allow set NA to --expect-cells (only number). So I've added the check --expect-cells. If it --expect-cells=NA or is not set we just skip it.
  2. VDJ samples. I was unable to test new scripts for VDJ samples on the provided test data because of 'Your reference does not contain the expected files, including '/path/to/references/refdata-cellranger-mm10-3.0.0/fasta/regions.fa', or they are not readable.' error. The '/path/to/references/refdata-cellranger-mm10-3.0.0/fasta/' folder includes only 'genome.fa' and 'genome.fa.fai' used for count tool. I've added the note to the workflow description: 'Please note that if you used vdj tool GENOME folder should include regions.fa file.'

What are your thoughts on these changes? Let me know, thank you!

syansanofi commented 4 years ago

@syansanofi Hi, Shu. I've prepared the pull request containing the described changes. Could you check it, please? I've used your test data for verification and found two concerns:

  1. Count tool. Now if forcedCells == 'NA' then --expect-cells is passed as parameter. But --expect-cells can have 'NA' value. The new cellranger version doesn't allow set NA to --expect-cells (only number). So I've added the check --expect-cells. If it --expect-cells=NA or is not set we just skip it.
  2. VDJ samples. I was unable to test new scripts for VDJ samples on the provided test data because of 'Your reference does not contain the expected files, including '/path/to/references/refdata-cellranger-mm10-3.0.0/fasta/regions.fa', or they are not readable.' error. The '/path/to/references/refdata-cellranger-mm10-3.0.0/fasta/' folder includes only 'genome.fa' and 'genome.fa.fai' used for count tool. I've added the note to the workflow description: 'Please note that if you used vdj tool GENOME folder should include regions.fa file.'

What are your thoughts on these changes? Let me know, thank you!

Thank you for the changes Yulia.

  1. I agree. Only numbers should be input, otherwise that field should not be present in the final shell script. Expected cells and forced cells are one-off parameters that users can specify when they see a previous run of the pipeline producing unsatisfactory results. In our experience, we typically set expected-cells and not force-cells, just an FYI if you are interested.

  2. VDJ should have its own reference file directory (typically pre-downloaded from 10x). I apologize if it was not clear from the test data. For example:

--references=/path/to/your/refernce/refdata-cellranger-vdj-GRCm38-alts-ensembl-3.1.0

I think it would be best handled by adding a new database field to the global_config_scRnaExpression_CellRanger_Fastq files that would indicate this. Something like:

VDJ_GENOME=/Users/e0445210/code/github/fonda/example/global_config/global_config_scRnaExpression_CellRanger_Fastq_v1.1_human.txt

kamyshova commented 4 years ago

@syansanofi Shu,

I think it would be best handled by adding a new database field to the global_config_scRnaExpression_CellRanger_Fastq files that would indicate this.

I support this solution. I've added the new commit with VDJ_GENOME.

syansanofi commented 4 years ago

I also fixed a small issue with vdj forced-cells field. See commit 16cdd9536848d4115c8f2cd27615e327d69313a9