clintval / sample-sheet

Parse Illumina sample sheets with Python
https://sample-sheet.rtfd.io
MIT License
49 stars 15 forks source link

validator #105

Open colindaven opened 4 years ago

colindaven commented 4 years ago

Hi,

thanks for this.

I have a lot of problems with incorrect and weird samplesheets from the lab generated with "copy-paste" and strange barcode schemes, such as mixed Truseq and Nextera.

I was starting to write a very simple validator to pick up on the worst errors, but now see you have done much more.

Are you planning to write a standalone validator or is this already possible via your library?

Thanks, Colin

clintval commented 4 years ago

I certainly could! I want to make a non-short-circuiting validation refactor but have not found the time: https://github.com/clintval/sample-sheet/issues/96

Right now, if anything looks "wrong" an exception is raised immediately. Instead of compiling all errors, and then emitting all of them at once at the end, you can actually fix your sample sheet once instead of iteratively.

What are some validations you would like to see included beyond those that are simply spec. non-conforming?

clintval commented 4 years ago

It would also be awesome if it supported plugin validations so you could easily extend base validations with custom lab validations.

clintval commented 4 years ago

Hmm, I could use this in our lab too. I will do my best to carve out some time.

colindaven commented 4 years ago

Nice. I had just started this, but these are the most common issues I see:

Like I said, I haven't go far at all so this is just very preliminary I'm afraid. The idea is the lab colleagues can run a simple .bat script, which runs a python script natively in Windows. Output comments and errors go to an output txt file. That way they get instant feedback without having to wait for a) data and b) a bioinformatician to run bcl2fastq.


def checkIndexAdaptersLine(inputLine):
    if "Index Adapters" in inputLine:
        indexAdaptersLineCount = indexAdaptersLineCount + 1
        if '"Index Adapters,""TruSeq DNA CD Indexes (96 Indexes)"""' in inputLine:
            outputComments.append("Index Adapters line looks good for TruSeq\n")      
        elif  '"Index Adapters,""IDT-ILMN Nextera DNA UD Indexes Set A"""' in inputLine:
            outputComments.append("Index Adapters line looks good for Nextera\n")
        else:
            outputComments.append("INFO: Could not read Index Adapters line properly\n")
            outputComments.append('INFO: Typically should be Truseq: "Index Adapters,""TruSeq DNA CD Indexes (96 Indexes)"""\n')
            outputComments.append('INFO: Typically should be Nextera "Index Adapters,""IDT-ILMN Nextera DNA UD Indexes Set A"""\n')

def checkSemicolons(inputLine):
    if ";" in inputLine:
        outputComments.append("\n\n####### ERROR !!! FOUND A SEMICOLON; SHOULD ONLY CONTAIN COMMA AS DELIMITERS !! ####### \n ")

def checkHeader(inputLine):
    if "Sample_ID" in inputLine:
        if "Sample_Name,Sample_Name" in inputLine:
            outputComments.append("Error: Sample_Name,Sample_Name should be Sample_ID,Sample_Name\n")
        if "Sample_ID,Sample_Name,Sample_Plate,Sample_Well,I7_Index_ID,index,I5_Index_ID,index2,Sample_Project,Description,,,," in inputLine:
            outputComments.append("Header looks ok\n")
            outputComments.append("INFO: Header: " + inputLine+"\n")
golharam commented 4 years ago

Do they use the Illumina experiment manager to create a sample sheet? Our lab folks do and it helps cut down errors.

colindaven commented 4 years ago

Nope, since they say it doesn't allow custom primers, which we use a lot of. eg for amplicons, Nextera, NEB etc.

clintval commented 4 years ago

@colindaven I like where you are headed:

Output comments and errors go to an output txt file.

Right now validation is fail-fast which really hurts the turn-around for making a valid sample sheet since you have to iteratively edit and parse the sample sheet to wade through each validation exception one-by-one. I agree we should refactor validation in this toolkit so it is modular and as lazy as possible (collect all exceptions, and then emit in bulk at the end of a validation call).

We're only on 0.11.0 so this is something I'm inclined to bundle into a v1 refactor and final public API.