InPreD / tso500_nxf_workflow

Nextflow workflow to run Illumina LocalApp and TSOPPI on TSO500 data
MIT License
0 stars 1 forks source link

Workflow sample sheet discussion #1

Open marrip opened 1 year ago

marrip commented 1 year ago

Hey guys,

I thought I would start of with this discussion already so people can think about it a bit. This sample sheet is not the one used for the sequencing run but specifically the nextflow workflow. I could imagine basing it the TSOPPI sample sheet we use in Bergen and add relevant/remove unnecessary information. I could imagine something like that:

patient_id sample_id type tumor_content tumor_site run_id
IPH00124 IPH00124-D03-T01-A19_P T 0.9 19 200501_NS500643_0793_AHG5MGBGYM
IPH00124 IPH00124-D01-N01-A19_P N 200501_NS500643_0793_AHG5MGBGYM
IPH00124 IPH00124-R05-T01-A19_P R 200501_NS500643_0793_AHG5MGBGYM

To link the different samples types together we use the first prefix of the sample_id and then we specify the type and run_id of each sample as well as the tumor_content and tumor_site. For tumor_site, I am not sure if we could just extract it from the sample_id(A19 in this case). In our sample sheet we also have an option to set the output names for the samples - I am wondering if this is something we want but if yes it is rather easy to add that to this file. Anything you guys need? Looking forward to get some feedback ☺️

gertrudeln commented 1 year ago

From the Sample ID nomenclature doc, If a sample ID is named following this nomeclature, then i think yes, we can extract both the following fields from it:

As for Bergen Haukeland, I think the proposed file (workflow sample sheet) has all the data necessary to start TSOPPI part of the pipeline. Fields that are "substrings" of the other fields can be extracted by the workflow.

About the option to set the output names for the samples , we have actually never used this option in Bergen (field is always set to NA)

Waiting to hear from other sites to know if this file actually covers all their input needs for the TSOPPI part :)

tinavisnovska commented 1 year ago

We have some samples getting through the pipeline which do not follow InPreD nomenclature strictly and it would be great if the pipeline works for them too as such samples are added to InPreD samples for sequencing (and thus go through processing together). The only difference in the sample names is in the patient part (PPPyyyy) - sometime character part is longer than 3 letters and sometimes the numerical part is shorter than 4 digits - other than that, all our sample names contain -Ann-Bpq-Cll part exactly as described in the InPreD nomenclature, so I thing it would be really easy to accommodate for such slightly different samples (stringsplitting on dashes should do the job...).

What does your _P at the end of sample_id stand for @marrip?

I'd be for removing patient_id, type and tumor_site columns as that information can be collected from sample_id - no need to store the same information multiple times, in my opinion.

There is one more column that we keep in such a file and that is barcode (possible values are from UP01 to UP16 ). The barcode is used a sample during sequencing. We use this information to generate sequencing samplesheet which is later provided to LocalApp to guide the analysis. If sequencing sample sheet is provided as an input in other nodes, then the barcode might be optional so that if it is present, then the sequencing samplesheet will be generated, else path to input sequencing samplesheet will be provided differently. Alternatively, maybe we can all switch to providing only sample name and barcodes and generating sequencing samplesheet in the pipeline if people like that idea. Do we even use the same barcode naming and sequencing samplesheets or do we have some differences there as well? ;)

As for specifying outputs, I do not have strong opinions on that, but maybe we can adjust as we go, if needed.

marrip commented 1 year ago

haha, I have no idea, really. @gertrudeln do you know what the _P is intended for?

I like that you want to remove as much a possible, @tinavisnovska . 👍 We can remove any information from the sample sheet that is already present in the sample_id. Just want to make sure that the first prefix in sample_id is always the same for different sample types etc.

Good point! If you generate the sample sheet from this information here barcode should be included. We can set up the pipeline to check for existing sample sheet or generate one. I am also open to the idea to generate it as part of the pipeline, good idea 🙂 . Not sure about the barcode naming. I can check for HUS and maybe everyone can check for their node.

Yes, outputs is a different story, let's start at the beginning and discuss any output we might need when we come to the individual processes ☺️

gertrudeln commented 1 year ago

About the _P I cannot say with absolute certainity what it is intended for! We kept it there to keep the code close to the original script as possible.

In the original bash script that we got from @danielvo , there is a for loop that creates a template for starting TSOPPI DNA post processing. In this loop positional arguments are used and one of them (Cx) is defined as the sample pair ID of the tumor DNA sample. In the original bash script all example entries for this argument end in _P. We just kept this "convention" of postpending the _P as it was in the original bash script. Eg IPH00124-D03-T01-A19_P

Perhaps Daniel can shed some light on the _P ...

danielvo commented 1 year ago

Hello everyone,

Thanks for starting the discussion (and many apologies for my late contribution)!

Daniel

marrip commented 1 year ago

Hey Daniel,

thank you for your input! ☺️

danielvo commented 1 year ago

Hello Martin,

The LocalApp isn't able to utilize matched normal/control samples, the whole pipeline is meant for tumor-only analysis. If a normal sample is available however, it can be analyzed as a separate tumor DNA sample (at least it will be considered as such by the LocalApp) - TSOPPI then takes care of aggregating data from the matched samples. The situation is in a way similar for the matched RNA samples. RNA samples are run through a separate analysis pipeline within the LocalApp. The advantage of matched DNA and RNA samples sharing a sample pair ID is that variants from both samples will be present in the same file (rather than the same variants being split among two files). It's similar with the LocalApp metrics output: the output will be formatted depending on whether the sample pair IDs are identical for the matched DNA and RNA samples, but the information content doesn't change. I prefer having the results sample-wise in this situation. Adding "_P" to sample IDs was a simple way to follow the recommendations of deriving sample pair IDs from sample IDs; I don't think I even considered using the sample ID values for the sample pair ID parameter as well. Tina, is it so that you use identical sample ID and sample pair IDs at AHUS? I suppose we could make that the default for InPreD if that is possible and desirable.

marrip commented 1 year ago

Ah, then I understand. Thank you, Daniel, for this detailed description! So we do not need the coupling for the LocalApp. I agree, making the sample_id the default sample_pair_id is probably the easiest way if there are no arguments against using it as is. 👌

tinavisnovska commented 1 year ago

@danielvo: yes, samplesheets at Ahus use sampleID value to be sample pair ID as well (I just took one of the samplesheets created by Torben and made a script creating identical type of samplesheet without too much understanding of what the pair ID is meant to represent,... ;)) and it seem to be working fine, so I think we can drop _P from sample_id and use that value as sample_pair_id.

@danielvo: you are right that if samplesheets are generated by this workflow, we need to accommodate for creating all the types of samplesheets which are in use at the moment (or will be used in the future). To do that, we need to collect them first, I would say.

@danielvo @marrip : yes, I think you are right that expecting all samples ever to follow inpred nomenclature is restricting. However, if we want to keep some of the info from the nomenclature redundant in the table, then I would suggest to keep redundant all the info used in the analysis and postprocessing - that would mean keep patient_id, split type to two columns molecule (DNA/RNA) and type (tumor/normal) - seems clearer to me as the type column now mixes the two pieces of information together, keep tumor_content, tumor_type, and run_id.

tinavisnovska commented 1 year ago

Also with the redundancy in this sample sheet, I feel that it might be prone to typos - it would be nice to have some functionality to check for consistency in case the sample_id follows InPreD nomenclature.

danielvo commented 1 year ago

Good feedback! Let us

1) use the "sample_id" as "sample_pair_id" (we will deal with the change somehow at OUS and HUS, it might also apply to Trondheim); 2) collect examples of the different sample sheets used within InPreD and tie any identified differences between them to the corresponding technical differences (in the end having one sample sheet per specific library/sequencing setup); 3) keep all the details in the workflow sample sheet (including separate columns for molecule- and sample type) and add also a "sample_id_format" column (we can call the current InPreD nomenclature "inpred_v1" format for example); we can then have parsers for recognized ID formats, such as "inpred_v1", ensuring that we can avoid redundancy when preparing these workflow sample sheets if possible (e.g., in our case, many column values could be parsed from the sample ID).

What is the idea regarding the matching of paired samples? This might be an issue when there are multiple samples of a given type for the patient available. We could also add columns for IDs of preferred matched samples (those could be left empty [with "NA"] when not relevant).

marrip commented 1 year ago
  1. I agree.
  2. sound like a good idea to me!
  3. I feel like we should somehow depict things graphically to see what goes in and what is dependent on what. Not sure we need to specify the nomenclature if we can just use regexes to match the right one and if it doesn't match we assume that the other columns (tumor_site, tumor_content, etc.) are present.
  4. This is also new information for me. So a single run might contain several samples of the same type? The question arises, why do you sequence several of them? Or is it samples from previous runs you want to match. There is a lot to take into account it seems and I feel you guys have a better understanding of what is needed. Would you have the possibility to specify these things more clearly considering all possible routes through the pipeline? Somekind of flow chart would probably be beneficial 🙂
danielvo commented 1 year ago

Regarding point 4, sequencing matched samples in different runs is a common occurrence for us at OUS. Having multiple samples of the same type and patient in the same run isn't as frequent, but we have certainly sequenced the primary tumor and metastasis together before (those could be matched with the same normal, if available, and with either the same or different RNA samples). We can discuss this further.

marrip commented 1 year ago

Do you think we should have an extra meeting discussing this or should we simply discuss this in one of the Friday meetings. I would opt for the former as I feel not everybody might be interested in this.

gertrudeln commented 1 year ago

I agree, a meeting dedicated to this would be very helpful :)

tinavisnovska commented 1 year ago
  1. I can try to make similar scripts as the one I have for our NextSeq and 16 barcodes setup, if you throw some examples of your sequencing sample sheets on me.

  2. here is my attempt to sum up info about where the info from this sample sheet will be used in the workflow:

column description
patient_id - used to collect all samples related to one patient for the report, but as mentioned before, does not work when multiple samples of one type are present in the primary seq run - the one that goes into LocalApp
sample_id - aggregation of most of the other values or something else
sample_id_format - inpred_v1, used to decide whether sample_id is aggregation of other values or not; if yes, check for consistency of the table info across columns.
molecule - goes into sequencing samplesheet -> LocalApp, goes into scripts that execute TSOPPI
sample_type - goes into scripts that execute TSOPPI
tumor_content - goes into scripts that execute TSOPPI
tumor_site - goes into scripts that execute TSOPPI
barcode - goes into sequencing samplesheet -> LocalApp
run_id - goes into scripts that execute TSOPPI

nice to have: hash to comment out lines in the file

  1. I used to comment out some samples and leave another and run the TSOPPI part multiple times to generate required TSOPPI/report outputs in such a case - very much hands on, not optimal at all, but did not manage to put anything better in place - I am all ears for ideas about how to deal with such situations.

  2. Separate meeting sounds like a good idea!

marrip commented 1 year ago

To sum up, our nextflow sample sheet will look like this:

dataset_id sample_id molecule sample_type tumor_site tumor_content run_id barcode
IPH0001 IPH0001-D01-T01-A19 dna tumor 19 0.5 200501_NS500643_0793_AHG5MGBGYM UDP0029

Another question, what do we allow in molecule and sample_type?

molecule possible
dna D, d, DNA, DNa, Dna, DnA, dNA, dna
rna R, r, RNA, RNa, Rna, RnA, rNA, rna

Should we only allow for single capital letter or all possibilities or something else?

sample_type possible
tumor T, t, tumor, Tumor, etc.
normal N, n, normal, Normal, etc.

Again, just single capital letter maybe or what do you deem best?

tinavisnovska commented 1 year ago

Thanks @marrip for putting the updated nextflow sample sheet together!

as for molecule and sample_type - I can imagine checking for all of the mentioned values pretty easily, so maybe we should do that for the convenience,... things would get a bit trickier with "tumour" but still possible.

however, two more comments: