NorwegianVeterinaryInstitute / DemultiplexRawSequenceData

A workflow automation script: demultiplex the library sequence, run quality checks, deliver to archiving and processing afterwards
GNU General Public License v3.0
1 stars 0 forks source link

feature request: SampleSheet.csv validation #21

Open georgemarselis-nvi opened 2 years ago

georgemarselis-nvi commented 2 years ago

Regardless of LIMS or not, there can always be typos.

it will pay if we devote some time to either incorporate a SampleSheet.csv, or build one ourselves , in order to preemptively detect semantic errors or typos. The pay-off will be having a more automated and seemingly run/demultiplexing

according to @CathrineAB, The common errors found in SampleSheet.csv are:

  1. [ ] Space in sample name or project name. Especially hard to see if they occur at the end of the name. I replace the spaces with a “-“ if in middle of name. I erase the space if it is at the end.
  2. [ ] Æ, Ø or Å in sample name or project names.
  3. [ ] Extra lines in SampleSheet with no sample info in them. Will appear as a bunch of commas for each line which is empty. They need to be deleted or demuxing fails.
  4. [ ] Forget to put ekstra column called “Analysis” and set an “x” in that column for all samples (I don’t know if we will keep this feature for the future)
  5. [ ] '.' in sample names
  6. [ ] my own note: Check for commas == specific number ( ex: There are too many commas between ‘A1’ and ‘RRBS-NMBU’ )
  7. [ ] Check for missing commas: state machine and report if state N is missing comma after transitioning to N+1 state ( ex: a comma was missing between ’Sample1’ and ‘LPRSSBASNMBU1 )
arvindsundaram commented 2 years ago

Some of these said errors will also showup in the Samplesheet.csv produced by the LIMS. It is better to parse the csv file through a clean-up script before demultiplexing.