Workflow sample sheet discussion

marrip commented 1 year ago

Hey guys,

I thought I would start of with this discussion already so people can think about it a bit. This sample sheet is not the one used for the sequencing run but specifically the nextflow workflow. I could imagine basing it the TSOPPI sample sheet we use in Bergen and add relevant/remove unnecessary information. I could imagine something like that:

patient_id	sample_id	type	tumor_content	tumor_site	run_id
IPH00124	IPH00124-D03-T01-A19_P	T	0.9	19	200501_NS500643_0793_AHG5MGBGYM
IPH00124	IPH00124-D01-N01-A19_P	N			200501_NS500643_0793_AHG5MGBGYM
IPH00124	IPH00124-R05-T01-A19_P	R			200501_NS500643_0793_AHG5MGBGYM

To link the different samples types together we use the first prefix of the sample_id and then we specify the type and run_id of each sample as well as the tumor_content and tumor_site. For tumor_site, I am not sure if we could just extract it from the sample_id(A19 in this case). In our sample sheet we also have an option to set the output names for the samples - I am wondering if this is something we want but if yes it is rather easy to add that to this file. Anything you guys need? Looking forward to get some feedback ☺️

gertrudeln commented 1 year ago

From the Sample ID nomenclature doc, If a sample ID is named following this nomeclature, then i think yes, we can extract both the following fields from it:

Sample type code (type)
The two-digit code for tumor site (tumor_site)

As for Bergen Haukeland, I think the proposed file (workflow sample sheet) has all the data necessary to start TSOPPI part of the pipeline. Fields that are "substrings" of the other fields can be extracted by the workflow.

About the option to set the output names for the samples , we have actually never used this option in Bergen (field is always set to NA)

Waiting to hear from other sites to know if this file actually covers all their input needs for the TSOPPI part :)

tinavisnovska commented 1 year ago

We have some samples getting through the pipeline which do not follow InPreD nomenclature strictly and it would be great if the pipeline works for them too as such samples are added to InPreD samples for sequencing (and thus go through processing together). The only difference in the sample names is in the patient part (PPPyyyy) - sometime character part is longer than 3 letters and sometimes the numerical part is shorter than 4 digits - other than that, all our sample names contain -Ann-Bpq-Cll part exactly as described in the InPreD nomenclature, so I thing it would be really easy to accommodate for such slightly different samples (stringsplitting on dashes should do the job...).

What does your _P at the end of sample_id stand for @marrip?

I'd be for removing patient_id, type and tumor_site columns as that information can be collected from sample_id - no need to store the same information multiple times, in my opinion.

There is one more column that we keep in such a file and that is barcode (possible values are from UP01 to UP16 ). The barcode is used a sample during sequencing. We use this information to generate sequencing samplesheet which is later provided to LocalApp to guide the analysis. If sequencing sample sheet is provided as an input in other nodes, then the barcode might be optional so that if it is present, then the sequencing samplesheet will be generated, else path to input sequencing samplesheet will be provided differently. Alternatively, maybe we can all switch to providing only sample name and barcodes and generating sequencing samplesheet in the pipeline if people like that idea. Do we even use the same barcode naming and sequencing samplesheets or do we have some differences there as well? ;)

As for specifying outputs, I do not have strong opinions on that, but maybe we can adjust as we go, if needed.

marrip commented 1 year ago

haha, I have no idea, really. @gertrudeln do you know what the _P is intended for?

I like that you want to remove as much a possible, @tinavisnovska . 👍 We can remove any information from the sample sheet that is already present in the sample_id. Just want to make sure that the first prefix in sample_id is always the same for different sample types etc.

Good point! If you generate the sample sheet from this information here barcode should be included. We can set up the pipeline to check for existing sample sheet or generate one. I am also open to the idea to generate it as part of the pipeline, good idea 🙂 . Not sure about the barcode naming. I can check for HUS and maybe everyone can check for their node.

Yes, outputs is a different story, let's start at the beginning and discuss any output we might need when we come to the individual processes ☺️

gertrudeln commented 1 year ago

About the _P I cannot say with absolute certainity what it is intended for! We kept it there to keep the code close to the original script as possible.

In the original bash script that we got from @danielvo , there is a for loop that creates a template for starting TSOPPI DNA post processing. In this loop positional arguments are used and one of them (Cx) is defined as the sample pair ID of the tumor DNA sample. In the original bash script all example entries for this argument end in _P. We just kept this "convention" of postpending the _P as it was in the original bash script. Eg IPH00124-D03-T01-A19_P

Perhaps Daniel can shed some light on the _P ...

danielvo commented 1 year ago

Hello everyone,

Thanks for starting the discussion (and many apologies for my late contribution)!

If all sample IDs strictly adhere to the InPreD sample ID nomenclature, we could indeed parse patient ID, sample type and tumor site information from the sample IDs. If TSOPPI and the Nextflow wrapper are ever used on samples generated outside of InPreD, the sample IDs might not contain the necessary information (or the formatting might be unexpected). I see three possible solutions for those cases: 1) keeping the current redundancy; 2) removing the current redundancy, with the wrapper potentially having multiple versions in the future; 3) removing the current redundancy and some future sample IDs potentially having to be renamed in order to fit the expectations we have for InPreD IDs now. I think option 1) is the most flexible and perhaps doesn't require much additional work for us (?), but please let me know if you think some other way forward would be better.
I assume the sample sheet creation/barcode usage depends on both library preparation and the sequencing process, which might differ slightly at different nodes (e.g., multiplexing 8 samples with a single sample index on a NextSeq machine vs. multiplexing 16 samples with dual indexes on a NovaSeq machine). I know we have gone through a few different setups in Oslo, and we currently create LocalApp sample sheets by combining "a sequencing sample sheet precursor" from the lab with Ilumina sample index/barcode dictionary information.. Having an option to create a sample sheet from the scratch with the wrapper would be a plus, in my opinion, but I assume we'd have to take into account different possible scenarios here (perhaps using additional config files that would fit the different setups).
At OUS, we have used the option of choosing custom output sample IDs a couple of times. The sample IDs require a change sometimes, as some of the initially provided information might be incorrect (in case of a sample swap or in case of wrong tumor type/tumor site specification) and then one has 2 options: 1) changing the original sample sheet information, rerunning the LocalApp, and finally rerunning TSOPPI, or 2) rerunning TSOPPI with changed output sample IDs. The latter option is much faster if one is only interested in having TSOPPI output with the correct sample IDs (though I'd also recommend going through option 1) when the schedule finally allows it, in order to ensure that all files at all levels use consistent sample IDs).
The "_P" part of the nomenclature represents the change from "sample ID" to "sample pair ID". Originally, the LocalApp required a sample pair ID to be specified for all samples - it was meant to tie matched tumor DNA and tumor RNA samples together. The only change it meant for the output however was changing the output directory structure and merging few overview files (not adding any new information). As we didn't like the alternative directory structures at OUS, we simply appended "_P" to all sample IDs, which created predictable obligatory sample pair IDs but didn't lead to any file/directory merges (as the matched tumor DNA and tumor RNA samples had different pair IDs then). The sample pair ID is now optional I think (still allowing for the dual structures..), and I believe AHUS uses sample pair IDs identical to sample IDs (?), but we stick to our original nomenclature at OUS.

Daniel

marrip commented 1 year ago

Hey Daniel,

thank you for your input! ☺️

If you want to allow for sample ids not following the InPred nomenclature we should probably keep all required columns (type, tumor site, patient id).
We can make site specific config profiles in which we describe which settings to use to generate the SampleSheet for the LocalApp.
If it is necessary to keep the renaming option, we can just add a column output_id or similar.
I am not completely sure I understand. So the LocalApp does not analyse the tumor, normal and RNA samples of a single patient together? Or does it? As I understand you description it is keeping them separate. But why do we add _P? The sample ids are different anyways so wouldn't it be enough to just use them as the sample pair ids? I feel like we should discuss this a bit maybe in a Friday meeting.

danielvo commented 1 year ago

Hello Martin,

The LocalApp isn't able to utilize matched normal/control samples, the whole pipeline is meant for tumor-only analysis. If a normal sample is available however, it can be analyzed as a separate tumor DNA sample (at least it will be considered as such by the LocalApp) - TSOPPI then takes care of aggregating data from the matched samples. The situation is in a way similar for the matched RNA samples. RNA samples are run through a separate analysis pipeline within the LocalApp. The advantage of matched DNA and RNA samples sharing a sample pair ID is that variants from both samples will be present in the same file (rather than the same variants being split among two files). It's similar with the LocalApp metrics output: the output will be formatted depending on whether the sample pair IDs are identical for the matched DNA and RNA samples, but the information content doesn't change. I prefer having the results sample-wise in this situation. Adding "_P" to sample IDs was a simple way to follow the recommendations of deriving sample pair IDs from sample IDs; I don't think I even considered using the sample ID values for the sample pair ID parameter as well. Tina, is it so that you use identical sample ID and sample pair IDs at AHUS? I suppose we could make that the default for InPreD if that is possible and desirable.

marrip commented 1 year ago

Ah, then I understand. Thank you, Daniel, for this detailed description! So we do not need the coupling for the LocalApp. I agree, making the sample_id the default sample_pair_id is probably the easiest way if there are no arguments against using it as is. 👌

tinavisnovska commented 1 year ago

@danielvo: yes, samplesheets at Ahus use sampleID value to be sample pair ID as well (I just took one of the samplesheets created by Torben and made a script creating identical type of samplesheet without too much understanding of what the pair ID is meant to represent,... ;)) and it seem to be working fine, so I think we can drop _P from sample_id and use that value as sample_pair_id.

@danielvo: you are right that if samplesheets are generated by this workflow, we need to accommodate for creating all the types of samplesheets which are in use at the moment (or will be used in the future). To do that, we need to collect them first, I would say.

@danielvo @marrip : yes, I think you are right that expecting all samples ever to follow inpred nomenclature is restricting. However, if we want to keep some of the info from the nomenclature redundant in the table, then I would suggest to keep redundant all the info used in the analysis and postprocessing - that would mean keep patient_id, split type to two columns molecule (DNA/RNA) and type (tumor/normal) - seems clearer to me as the type column now mixes the two pieces of information together, keep tumor_content, tumor_type, and run_id.

tinavisnovska commented 1 year ago

Also with the redundancy in this sample sheet, I feel that it might be prone to typos - it would be nice to have some functionality to check for consistency in case the sample_id follows InPreD nomenclature.

danielvo commented 1 year ago

Good feedback! Let us

1) use the "sample_id" as "sample_pair_id" (we will deal with the change somehow at OUS and HUS, it might also apply to Trondheim); 2) collect examples of the different sample sheets used within InPreD and tie any identified differences between them to the corresponding technical differences (in the end having one sample sheet per specific library/sequencing setup); 3) keep all the details in the workflow sample sheet (including separate columns for molecule- and sample type) and add also a "sample_id_format" column (we can call the current InPreD nomenclature "inpred_v1" format for example); we can then have parsers for recognized ID formats, such as "inpred_v1", ensuring that we can avoid redundancy when preparing these workflow sample sheets if possible (e.g., in our case, many column values could be parsed from the sample ID).

What is the idea regarding the matching of paired samples? This might be an issue when there are multiple samples of a given type for the patient available. We could also add columns for IDs of preferred matched samples (those could be left empty [with "NA"] when not relevant).

marrip commented 1 year ago

I agree.
sound like a good idea to me!
I feel like we should somehow depict things graphically to see what goes in and what is dependent on what. Not sure we need to specify the nomenclature if we can just use regexes to match the right one and if it doesn't match we assume that the other columns (tumor_site, tumor_content, etc.) are present.
This is also new information for me. So a single run might contain several samples of the same type? The question arises, why do you sequence several of them? Or is it samples from previous runs you want to match. There is a lot to take into account it seems and I feel you guys have a better understanding of what is needed. Would you have the possibility to specify these things more clearly considering all possible routes through the pipeline? Somekind of flow chart would probably be beneficial 🙂

danielvo commented 1 year ago

Regarding point 4, sequencing matched samples in different runs is a common occurrence for us at OUS. Having multiple samples of the same type and patient in the same run isn't as frequent, but we have certainly sequenced the primary tumor and metastasis together before (those could be matched with the same normal, if available, and with either the same or different RNA samples). We can discuss this further.

marrip commented 1 year ago

Do you think we should have an extra meeting discussing this or should we simply discuss this in one of the Friday meetings. I would opt for the former as I feel not everybody might be interested in this.

gertrudeln commented 1 year ago

I agree, a meeting dedicated to this would be very helpful :)

tinavisnovska commented 1 year ago

I can try to make similar scripts as the one I have for our NextSeq and 16 barcodes setup, if you throw some examples of your sequencing sample sheets on me.
here is my attempt to sum up info about where the info from this sample sheet will be used in the workflow:

column	description
`patient_id`	- used to collect all samples related to one patient for the report, but as mentioned before, does not work when multiple samples of one type are present in the primary seq run - the one that goes into LocalApp
`sample_id`	- aggregation of most of the other values or something else
`sample_id_format`	- inpred_v1, used to decide whether `sample_id` is aggregation of other values or not; if yes, check for consistency of the table info across columns.
`molecule`	- goes into sequencing samplesheet -> LocalApp, goes into scripts that execute TSOPPI
`sample_type`	- goes into scripts that execute TSOPPI
`tumor_content`	- goes into scripts that execute TSOPPI
`tumor_site`	- goes into scripts that execute TSOPPI
`barcode`	- goes into sequencing samplesheet -> LocalApp
`run_id`	- goes into scripts that execute TSOPPI

nice to have: hash to comment out lines in the file

I used to comment out some samples and leave another and run the TSOPPI part multiple times to generate required TSOPPI/report outputs in such a case - very much hands on, not optimal at all, but did not manage to put anything better in place - I am all ears for ideas about how to deal with such situations.
Separate meeting sounds like a good idea!

marrip commented 1 year ago

To sum up, our nextflow sample sheet will look like this:

dataset_id	sample_id	molecule	sample_type	tumor_site	tumor_content	run_id	barcode
IPH0001	IPH0001-D01-T01-A19	dna	tumor	19	0.5	200501_NS500643_0793_AHG5MGBGYM	UDP0029

Another question, what do we allow in molecule and sample_type?

molecule	possible
dna	D, d, DNA, DNa, Dna, DnA, dNA, dna
rna	R, r, RNA, RNa, Rna, RnA, rNA, rna

Should we only allow for single capital letter or all possibilities or something else?

sample_type	possible
tumor	T, t, tumor, Tumor, etc.
normal	N, n, normal, Normal, etc.

Again, just single capital letter maybe or what do you deem best?

tinavisnovska commented 1 year ago

Thanks @marrip for putting the updated nextflow sample sheet together!

as for molecule and sample_type - I can imagine checking for all of the mentioned values pretty easily, so maybe we should do that for the convenience,... things would get a bit trickier with "tumour" but still possible.

however, two more comments:

regarding tumor_content: 1. NA/unknown/empty - at least one of these should be allowed - in which case data for 100% and 50% will be generated when TSOPPI runs (this is default TSOPPI behaviour when no tumor content is provided). 2. we historically provide tumor_content in 0-100 not in 0-1 range - this surely can be adjusted but maybe we can ask others what would make more sense for their wet labs to report.
regarding barcode: empty string should be also allowed in which case column index should contain index sequence used for the sample.

InPreD / tso500_nxf_workflow

Workflow sample sheet discussion #1