flow-r / ultraseq

0 stars 1 forks source link

Auto-detecting new illumina fastq format, and strict checking. #7

Open sahilseth opened 8 years ago

sahilseth commented 8 years ago

To enforce strict_format_checking use the following:

create_fq_sheet(path, strict_format_checking = TRUE)

If some file names do not follow the format properly die with error (TRUE), continue with warning (FALSE)

strict checking OFF:

Using CASAVA 1.8 naming format
> split_names_fastq2(fq1820, fmt1820)
                                     samplename   index lane read  num                                                      file
1                                    ABCD_00647 NoIndex    1    1  001                   ABCD_00647_NoIndex_L001_R1_001.fastq.gz
2                                    ABCD_00835  TGACCA    1    1  004                    ABCD_00835_TGACCA_L001_R1_004.fastq.gz
3                 ABCD-FC112-MS11-Cap854-3-ID09  GATCAG    8    1  001 ABCD-FC112-MS11-Cap854-3-ID09_GATCAG_L008_R1_001.fastq.gz
4           ABCD_00914_S19_L008_R1_001.fastq.gz    <NA> <NA> <NA> <NA>                       ABCD_00914_S19_L008_R1_001.fastq.gz
5 AB-8-20m-DOX-serbp-1_S43_L008_R1_001.fastq.gz    <NA> <NA> <NA> <NA>             AB-8-20m-DOX-serbp-1_S43_L008_R1_001.fastq.gz
6              ABCD-ABC_S1_L001_R1_001.fastq.gz    <NA> <NA> <NA> <NA>                          ABCD-ABC_S1_L001_R1_001.fastq.gz
Warning messages:
1: In FUN(X[[i]], ...) :
  there was a issue parsing this filename: ABCD_00914_S19_L008_R1_001.fastq.gz
2: In FUN(X[[i]], ...) :
  there was a issue parsing this filename: AB-8-20m-DOX-serbp-1_S43_L008_R1_001.fastq.gz
3: In FUN(X[[i]], ...) :
  there was a issue parsing this filename: ABCD-ABC_S1_L001_R1_001.fastq.gz

strict checking ON:

> split_names_fastq2(fq1820, fmt1820, strict_format_checking = TRUE)
Error in split_names_fastq2(fq1820, fmt1820, strict_format_checking = TRUE) : 
  Some file names do not have the correct format, please check.
Refer to the warning message provided below.
In addition: Warning messages:
1: In FUN(X[[i]], ...) :
  there was a issue parsing this filename: ABCD_00914_S19_L008_R1_001.fastq.gz
2: In FUN(X[[i]], ...) :
  there was a issue parsing this filename: AB-8-20m-DOX-serbp-1_S43_L008_R1_001.fastq.gz
3: In FUN(X[[i]], ...) :
  there was a issue parsing this filename: ABCD-ABC_S1_L001_R1_001.fastq.gz

As of now, the default for strict_format_checking is FALSE. I may not switch the default - since some of my downstream code depends on it.