Closed syansanofi closed 4 years ago
@syansanofi Hi, Shu. Would you like to implement these changes by yourself? Or should I start implementation? If so, it would be great to get test data from you.
@kamyshova Hi Yulia, thank you. I will email you an example run and its configs.
@syansanofi Hi, Shu. I support your idea about the master column. As I can see, we pass the column value to count tool as --id argument. In addition, the master column allows us to unambiguously map the sample_manifest.txt table entity to the final samples.
@kamyshova Hi Yulia. Thank you for the feedback. Then I think we can proceed as agreed. For future reference, we will have master, library type added and remove sample_type, match_control.
@syansanofi Hi, Shu. I've prepared the pull request containing the described changes. Could you check it, please? I've used your test data for verification and found two concerns:
What are your thoughts on these changes? Let me know, thank you!
@syansanofi Hi, Shu. I've prepared the pull request containing the described changes. Could you check it, please? I've used your test data for verification and found two concerns:
- Count tool. Now if forcedCells == 'NA' then --expect-cells is passed as parameter. But --expect-cells can have 'NA' value. The new cellranger version doesn't allow set NA to --expect-cells (only number). So I've added the check --expect-cells. If it --expect-cells=NA or is not set we just skip it.
- VDJ samples. I was unable to test new scripts for VDJ samples on the provided test data because of 'Your reference does not contain the expected files, including '/path/to/references/refdata-cellranger-mm10-3.0.0/fasta/regions.fa', or they are not readable.' error. The '/path/to/references/refdata-cellranger-mm10-3.0.0/fasta/' folder includes only 'genome.fa' and 'genome.fa.fai' used for count tool. I've added the note to the workflow description: 'Please note that if you used vdj tool GENOME folder should include regions.fa file.'
What are your thoughts on these changes? Let me know, thank you!
Thank you for the changes Yulia.
I agree. Only numbers should be input, otherwise that field should not be present in the final shell script. Expected cells and forced cells are one-off parameters that users can specify when they see a previous run of the pipeline producing unsatisfactory results. In our experience, we typically set expected-cells and not force-cells, just an FYI if you are interested.
VDJ should have its own reference file directory (typically pre-downloaded from 10x). I apologize if it was not clear from the test data. For example:
--references=/path/to/your/refernce/refdata-cellranger-vdj-GRCm38-alts-ensembl-3.1.0
I think it would be best handled by adding a new database field to the global_config_scRnaExpression_CellRanger_Fastq files that would indicate this. Something like:
VDJ_GENOME=/Users/e0445210/code/github/fonda/example/global_config/global_config_scRnaExpression_CellRanger_Fastq_v1.1_human.txt
@syansanofi Shu,
I think it would be best handled by adding a new database field to the global_config_scRnaExpression_CellRanger_Fastq files that would indicate this.
I support this solution. I've added the new commit with VDJ_GENOME
.
I also fixed a small issue with vdj forced-cells field. See commit 16cdd9536848d4115c8f2cd27615e327d69313a9
Issue
There are two issues. First, feature barcode syntax. Second, combining VDJ workflow with new scRNAExpression-CellRangerFastq workflow.
Feature Barcode
Currently, FONDA uses 'old' cellranger syntax. This supports a single library per sample. The design of this syntax predates antibody capture technology (CITE-SEQ). Cellranger currently supports CITE-SEQ analysis by using 'feature barcode' through a new syntax.
Old syntax
cellranger count --id Sample1 --sample Sample1 --transcriptome /path/to/reference --fastqs /path/to/fastqs
New syntax
cellranger count --id Sample1 --transcriptome /path/to/reference--libraries sample1_libraries.csv --feature-ref feature_reference.csv
https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/feature-bc-analysis
In essence, new syntax supports multi-library runs directly whereas previously it was required to input a custom 'Martian' pipeline file to cellranger. Therefore, the new syntax requires TWO new files.
Sample_libraries.csv
This file outlines library information with the following columns:
Some example situations for a single sample can be but not limited to:
In other words, most if not any combination of libraries are valid including sample groupings without gene expression where only feature barcodes were sequenced. Also, multiple sample Cellranger takes the --id field as the sample name. This characteristic is critical as current experimental designs often entail a one to many relationship for ADT to Gene Expression library pairing. For example: 10 Gex libraries matched against 1 Antibody Capture library.
feature-ref.csv
This csv file has the following columns:
This file can be seen as an additional reference file to the transcriptome.
VDJ
The multi library design is not limited to cellranger count but also includes cellranger vdj. Experiments often include VDJ libraries as well. Currently the workflows are separate per design mentality of 10X genomics (advice to run cellranger 2x for count and vdj). This is still the general idea but it is necessary to bring VDJ into a unified COUNT+VDJ workflow in FONDA due to processing time. VDJ and COUNT should be parallelized, otherwise processing time would be unacceptable for daily use.
Proposal
Changes
1. Dynamic generation of sample_libraries.csv using sample_manifest.txt table to add parameters to the fastq sample entity for SCRnaExpression-CellRangerFastq
Following the example of DNAcaptureVar_Fastq design, we can add two columns (including or not including tumor / case columns) to the sample manifest table.
2. Add feature-ref field to global_config database section.
Although through current testing we have not seen issues when using a comprehensive feature-ref.csv file that includes all possible library type and barcodes, it would be better to leave the management of this file outside of FONDA since this list could undergo modification and thus create conflicts. It is convenient to think of this file in a separate fashion to the 'BED' files in DnaCaptureVar_Fastq workflow.
3. Modify syntax of count template file to match new syntax
This is self explanatory.
4. Bring VDJ and COUNT together as toolset parameters in a new workflow
Current output from SCRnaExpression-CellRangerFastq has something similar to:
We can simply add VDJ as a new tool to the toolset and when FONDA parses sample manifest and sees vdj specify as library type, it could execute the VDJ tool and direct results to new VDJ output dir alongside COUNT.
See a rough sketch of the proposal. Apologies for the low quality output from MIRO.