fpbarthel / GLASS

GLASS consortium
MIT License
37 stars 13 forks source link

Duplicate RGID causing pipeline to crash #42

Closed fpbarthel closed 6 years ago

fpbarthel commented 6 years ago

See here (line 70-75). There is duplicate readgroup HF22.1 and HF22.2 for sample GLSS-MD-0008-NB-DQBGTB. The readgroup PU's are different though. Wonder what we should do about it. Changing the readgroup IDs will be sufficient to fix this problem, but we would have to deviate from our convention here. Any thoughts?

Kcjohnson commented 6 years ago

I don't think there is a systematic, non-disruptive way around this issue. Should we just reassign the RGID by shifting it down a character? Make it so that they are F22L.1 and F22G.1 instead of both being HF22.1?

fpbarthel commented 6 years ago

Yeah lets do that. Why did we pick first four characters again and not eg. first five or even the entire length?

On Aug 14, 2018, at 7:33 AM, Kevin C. Johnson notifications@github.com wrote:

I don't think there is a systematic, non-disruptive way around this issue. Should we just reassign the RGID by shifting it down a character? Make it so that they are F22L.1 and F22G.1 instead of both being HF22.1?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

Kcjohnson commented 6 years ago

I think it had to do with how the TCGA/GDC structured their data. We were trying to minimize the differences between the metadata.

fpbarthel commented 6 years ago

Yea I think so. We should consider putting together an ideal set of JSONs with all the information we have now, but let’s do that after we reach data freeze.

On Aug 14, 2018, at 8:27 AM, Kevin C. Johnson notifications@github.com wrote:

I think it had to do with how the TCGA/GDC structured their data. We were trying to minimize the differences between the metadata.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

sbamin commented 6 years ago

I will go with full length to avoid ambiguity, say in future someone (might have it already!) comes up with a tool to query RGID with a full-length sequence ID of each read, easier if we have full length, regex-compliant ID.

I find sample schema validator for snakemake useful for validating sample output. For CGP, I have a barebone schema that does validation. This site has nice tutorial on schema options. For now, I guess it's a low priority on my end but I will implement for GLASS singularity based workflows for strict checks for both sample and config files.

If schema is not something we need, following is a kind of check I used for canine wgs/wes flowr based workflow.

  fqs_map_report <- fqs_map %>% 
    mutate(FQ1PATH=FQ1, FQ2PATH=FQ2) %>% 
    mutate_at(vars(FQ1,FQ2), funs(file.access), mode=4) %>%
    mutate(FQ1CHK = ifelse(FQ1 == 0, "OK", "ERROR-FQ1_NOT_EXITST_OR_READABLE"), 
           FQ2CHK = ifelse(FQ2 == 0, "OK", "ERROR-FQ2_NOT_EXITST_OR_READABLE")) %>%
    mutate(FQ12CHK = ifelse(FQ1PATH != FQ2PATH, "OK", "ERROR-IDENTICAL_PE_FQ_FILE_PATH")) %>% 
    mutate(LIBCHK = ifelse(grepl("ILLUMINA", toupper(RGPL)), "OK", "WARN-NON_ILLUMINA_LIB")) %>%
    mutate(RGIDCHK = ifelse(duplicated(RGID), "ERROR-DUPLICATE_FOUND", "OK")) %>%
    mutate(RGSMCHK = ifelse(unique(RGSM) == RGSM, "OK", "ERROR-MORE_THAN_ONE_SAMPLE_FOUND")) %>%
    mutate(RGDT = rep(dtime, nrow(fqs_map)),
           rgstr = sprintf("\"@RG\\tID:%s\\tPL:%s\\tPU:%s\\tLB:%s\\tPI:0\\tDT:%s\\tSM:%s\\tCN:%s\"", 
                RGID, RGPL, RGPU, RGLB, RGDT, RGSM, RGCN),
           outtag = paste0(RGSM,'-',RGID)) %>%
    mutate(rgstrchk = ifelse(duplicated(rgstr), "ERROR-DUPLICATE_FOUND", "OK"),
           outtagchk = ifelse(duplicated(outtag), "ERROR-DUPLICATE_FOUND", "OK")) %>%
    select(RGSM, RGSMCHK, FQ1PATH, FQ1CHK, FQ2PATH, FQ2CHK, FQ12CHK, RGPL, LIBCHK, RGID, 
           RGIDCHK, RGDT, rgstr, rgstrchk, outtag, outtagchk)

  ## output markdown formatted table of checklists
  knitr::kable(fqs_map_report, format = "markdown", results = "asis")

  ## Quit with error if fqs_map_file does not pass checklists
  if(any(grepl("ERROR", fqs_map_report)))
    stop("Error was reported while reading fastq map file.\n
         One or more of checks failed with respect to fastq file path, paired end fastq names, RGID tag, etc.\n
         See log file for more\n")
Kcjohnson commented 6 years ago

These checks look like they would be helpful. Definitely a good idea to incorporate as we near the data freeze.

fpbarthel commented 6 years ago

Temporarily fixed