clintval / sample-sheet

Parse Illumina sample sheets with Python
https://sample-sheet.rtfd.io
MIT License
49 stars 15 forks source link

Duplicated Sample_ID #53

Closed reisingerf closed 6 years ago

reisingerf commented 6 years ago

Hi Clint,

we had an issue before (#32) where the same Sample_ID caused issues, even if the lanes were different.

Now we are having the same issue, but without specifying lanes. This time it's 10X data, which you mentioned briefly in the previous issue.

Basically we would like to merge across lanes and indexes and the corresponding sample sheets end up having the same Sample_ID across multiple Data lines.

An example:

Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
PRJ180538_VPH20T,,,,,SI-GA-G6_1,CTGACGCG,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_2,GGTCGTAC,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_3,TCCTTCTT,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_4,AAAGAAGA,,,,

Here we merge across the four indexes and all lanes. However, your library does not currently allow this.

Could you allow duplicated Sample_IDs or recommend an alternative approach?

Thanks! Florian

clintval commented 6 years ago

Hi @reisingerf, this column must remain unique in the sample sheet as per the Illumina specification, however, this library does allow redundant Sample_IDs only if the lanes are different (key Lane).

At a minimum, the one column that is universally required is Sample_ID, which provides a unique string identifier for each sample.

Why, in your example, are you not using the Lane key?

Would you be willing to give me more background on why you cannot give unique Sample_IDs to these samples?

Also, for curiosity's sake, do you have any references from 10X I can look at?

Here are the current validations performed when a sample is added to the sample sheet:

https://github.com/clintval/sample-sheet/blob/35a84d70f15e7a4433ebfe19beceecb0d39d765c/sample_sheet/_sample_sheet.py#L528-L539

reisingerf commented 6 years ago

I have seen the specs from Illumina, but it does not explicitly say that Sample_IDs can't be duplicated. It only says they have to uniquely identify a sample. E.g. in our case we want to demultiplex several lanes and indexes into the same FASTQ file (Sample_ID). We are using 10X genomics libraries, each sample is indexed with 4 indexes and run across 4 lanes. Our aim is to get a SINGLE FASTQ file per-read per-sample.

clintval commented 6 years ago

To pass validation you need to specify the column Lane for each entry in your Sample Sheet.

Is that in conformance with the tools you are using from 10X? If not do send me a link.

clintval commented 6 years ago

So make it like this:

Sample_ID,Sample_Name,Lane,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
PRJ180538_VPH20T,1,,,,,SI-GA-G6_1,CTGACGCG,,,,
PRJ180538_VPH20T,2,,,,,SI-GA-G6_2,GGTCGTAC,,,,
PRJ180538_VPH20T,3,,,,,SI-GA-G6_3,TCCTTCTT,,,,
PRJ180538_VPH20T,4,,,,,SI-GA-G6_4,AAAGAAGA,,,,
reisingerf commented 6 years ago

I don’t think that’ll work (but I am not the expert on this). As far as I understand the demultiplexing works across all lanes if no Lane column is given. If we where to use the Lane column, we’d have to use 16 lines (4 indexes over 4 lanes).

Am I missing something?

On 8 Aug 2018, at 15:09, Clint Valentine notifications@github.com<mailto:notifications@github.com> wrote:

So make it like this:

Sample_ID,Sample_Name,Lane,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description PRJ180538_VPH20T,1,,,,,SI-GA-G6_1,CTGACGCG,,,, PRJ180538_VPH20T,2,,,,,SI-GA-G6_2,GGTCGTAC,,,, PRJ180538_VPH20T,3,,,,,SI-GA-G6_3,TCCTTCTT,,,, PRJ180538_VPH20T,4,,,,,SI-GA-G6_4,AAAGAAGA,,,,

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/clintval/sample-sheet/issues/53#issuecomment-411286971, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AebZc-aeeI1Ih4oLnDyT_p20eQZjE7z9ks5uOnJ_gaJpZM4VzQhS.

clintval commented 6 years ago

I think I follow (yes you would have to specify 16 times under the Illumina spec.) Which demultiplexor or pipeline are you using? Can you provide me with docs?

reisingerf commented 6 years ago

We are using bcl2fastq with the —no-lane-splitting option.

On 8 Aug 2018, at 15:25, Clint Valentine notifications@github.com<mailto:notifications@github.com> wrote:

I think I follow (yes you would have to specify 16 times under the Illumina spec.) Which demultiplexor or pipeline are you using? Can you provide me with docs?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/clintval/sample-sheet/issues/53#issuecomment-411289083, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AebZc3iU9OuaIlpE4djf6SFw0HTbGzmAks5uOnYkgaJpZM4VzQhS.

clintval commented 6 years ago

FYI, even the 10X demultiplexor requires specifying a Lane if you want to demultiplex per lane:

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq

In order to help you further debug. I will need two things:

  1. A copy of your sample sheet (feel free to censor)
  2. Exact invocation of your bcl2fastq command

Thanks! Hope I can help.

reisingerf commented 6 years ago

I will compile an example and get back to you. Thanks for your help!

reisingerf commented 6 years ago

Here is our SampleSheet.csv

[Header],,,,,,,,,,
IEMFileVersion,5,,,,,,,,,
Experiment Name,Tsqn180801,,,,,,,,,
Date,3/08/2018,,,,,,,,,
Workflow,GenerateFASTQ,,,,,,,,,
Application,NovaSeq FASTQ Only,,,,,,,,,
Instrument Type,NovaSeq,,,,,,,,,
Assay,TruSeq Nano DNA,,,,,,,,,
Index Adapters,IDT-ILMN TruSeq DNA UD Indexes (96 Indexes),,,,,,,,,
Description,Tsqn180801,,,,,,,,,
Chemistry,Amplicon,,,,,,,,,
,,,,,,,,,,
[Reads],,,,,,,,,,
151,,,,,,,,,,
151,,,,,,,,,,
,,,,,,,,,,
[Settings],,,,,,,,,,
Adapter,AGATCGGAAGAGCACACGTCTGAACTCCAGTCA,,,,,,,,,
AdapterRead2,AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT,,,,,,,,,
,,,,,,,,,,
[Data],,,,,,,,,,
Sample_ID,Sample_Name,Sample_Plate,Sample_Well,Index_Plate_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description
PRJ180538_VPH20T,,,,,SI-GA-G6_1,CTGACGCG,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_2,GGTCGTAC,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_3,TCCTTCTT,,,,
PRJ180538_VPH20T,,,,,SI-GA-G6_4,AAAGAAGA,,,,
PRJ180539_VCBPH5T,,,,,SI-GA-G7_1,GGTATGCA,,,,
PRJ180539_VCBPH5T,,,,,SI-GA-G7_2,CTCGAAAT,,,,
PRJ180539_VCBPH5T,,,,,SI-GA-G7_3,ACACCTTC,,,,
PRJ180539_VCBPH5T,,,,,SI-GA-G7_4,TAGTGCGG,,,,
PRJ180540_VCBP14T,,,,,SI-GA-G8_1,TATGAGCT,,,,
PRJ180540_VCBP14T,,,,,SI-GA-G8_2,CCGATAGC,,,,
PRJ180540_VCBP14T,,,,,SI-GA-G8_3,ATACCCAA,,,,
PRJ180540_VCBP14T,,,,,SI-GA-G8_4,GGCTGTTG,,,,
PRJ180541_VPH8T,,,,,SI-GA-G9_1,TAGGACGT,,,,
PRJ180541_VPH8T,,,,,SI-GA-G9_2,ATCCCACA,,,,
PRJ180541_VPH8T,,,,,SI-GA-G9_3,GGAATGTC,,,,
PRJ180541_VPH8T,,,,,SI-GA-G9_4,CCTTGTAG,,,,
PRJ180542_VPH23T,,,,,SI-GA-G10_1,TCGCCAGC,,,,
PRJ180542_VPH23T,,,,,SI-GA-G10_2,AATGTTAG,,,,
PRJ180542_VPH23T,,,,,SI-GA-G10_3,CGATAGCT,,,,
PRJ180542_VPH23T,,,,,SI-GA-G10_4,GTCAGCTA,,,,
PRJ180543_VPH36T,,,,,SI-GA-G11_1,TTATCGTT,,,,
PRJ180543_VPH36T,,,,,SI-GA-G11_2,AGCAGAGC,,,,
PRJ180543_VPH36T,,,,,SI-GA-G11_3,CATCTCCA,,,,
PRJ180543_VPH36T,,,,,SI-GA-G11_4,GCGGATAG,,,,
PRJ180544_PGL3,,,,,SI-GA-G12_1,ATTCTAAG,,,,
PRJ180544_PGL3,,,,,SI-GA-G12_2,CCCGATTA,,,,
PRJ180544_PGL3,,,,,SI-GA-G12_3,TGGAGGCT,,,,
PRJ180544_PGL3,,,,,SI-GA-G12_4,GAATCCGC,,,,
PRJ180545_LSI_noIAA,,,,,SI-GA-F3_1,TTCAGGTG,,,,
PRJ180545_LSI_noIAA,,,,,SI-GA-F3_2,ACGGACAT,,,,
PRJ180545_LSI_noIAA,,,,,SI-GA-F3_3,GATCTTGA,,,,
PRJ180545_LSI_noIAA,,,,,SI-GA-F3_4,CGATCACC,,,,
PRJ180546_LSI_IAA,,,,,SI-GA-F4_1,CCCAATAG,,,,
PRJ180546_LSI_IAA,,,,,SI-GA-F4_2,GTGTCGCT,,,,
PRJ180546_LSI_IAA,,,,,SI-GA-F4_3,AGAGTCGC,,,,
PRJ180546_LSI_IAA,,,,,SI-GA-F4_4,TATCGATA,,,,
PRJ180547_LL_30,,,,,SI-GA-F1_1,GTTGCAGC,,,,
PRJ180547_LL_30,,,,,SI-GA-F1_2,TGGAATTA,,,,
PRJ180547_LL_30,,,,,SI-GA-F1_3,CAATGGAG,,,,
PRJ180547_LL_30,,,,,SI-GA-F1_4,ACCCTCCT,,,,
PRJ180548_LL_38,,,,,SI-GA-F2_1,TTTACATG,,,,
PRJ180548_LL_38,,,,,SI-GA-F2_2,CGCGATAC,,,,
PRJ180548_LL_38,,,,,SI-GA-F2_3,ACGCGGGT,,,,
PRJ180548_LL_38,,,,,SI-GA-F2_4,GAATTCCA,,,,

And we use a command like this: bcl2fastq -R /novaseq/runfolder --sample-sheet /novaseq/runfolder/SampleSheet.csv -o /bcl2fastq_output/runfolder --create-fastq-for-index-reads --minimum-trimmed-read-length=8 --mask-short-adapter-reads=8 --ignore-missing-positions --ignore-missing-controls --ignore-missing-filter --ignore-missing-bcls --no-lane-splitting

clintval commented 6 years ago

Thanks. Quick question. Does IEM allow you to make that sample sheet? Does bcl2fastq accept it?

clintval commented 6 years ago

Just installed IEM and confirmed this is an allowable sample sheet!

reisingerf commented 6 years ago

I haven't checked with IEM, but we used that sample sheet with bcl2fastq successfully. (if it wouldn't work for us I would not request the change ;) )

clintval commented 6 years ago

Bug confirmed, thanks @reisingerf, I've misinterpreted the spec.

reisingerf commented 6 years ago

No worries! I initially was confused by the spec as well. I only realised due to our new requirement.

clintval commented 6 years ago

This library now allows adding equivalent samples, with a warning. See the docs for add_sample() here for a way to turn-off that warning:

http://sample-sheet.readthedocs.io/sample_sheet.html#sample_sheet.SampleSheet.add_sample

I have it on my radar to do more research FYI.