AndersenLab / annotation-nf

Annotate VCF with snpeff and bcsq
1 stars 1 forks source link

Error generating CE flat file - Empty Divergent Region Bed #8

Open mckeowr1 opened 1 year ago

mckeowr1 commented 1 year ago

The nf pipeline terminated during the make_flat_file process. The Rscript failed and threw this erorr

  Joining, by = c("CHROM", "POS", "REF", "ALT", "ANNOTATION")
  Warning message:
  Expected 7 pieces. Missing pieces filled with `NA` in 427215 rows [3, 4, 5, 6, 67, 72, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 199, 200, 201, 356, ...]. 
  Error in `[.data.frame`(x, i, j, drop) : undefined columns selected
  Calls: %>% ... <Anonymous> -> fuzzy_join -> [ -> [.data.table -> [.data.frame
  In addition: Warning message:
  In data.table::fread(args[4], col.names = c("chrom", "start", "end")) :
    File 'div.bed' has size 0. Returning a NULL data.table.
  Execution halted

It looks like its happening here when we add the divergent region status for CE. This sense that this failure is CE specific.

if(args[5] == "c_elegans") {
    divergent <- data.table::fread(args[4], 
                                   col.names = c("chrom", "start", "end"))

    #Condense Overlapping Regions (If portion of regions shared by >1 strian
    condensed <- divergent %>%
      dplyr::mutate(DIVERGENT = "D") # Add marker to Divergent Region

    #Join the data - if a position is within a divergent region Divergent tag is added

    test <- table_ready %>%
      dplyr::mutate(start = POS, end = POS)
    join <- fuzzyjoin::genome_join(test, condensed, by = c(
      "CHROM" = "chrom", 
      "start" = "start", 
      "end" = "end"), 
      mode = "left") %>%
      dplyr::select(-start.x, -start.y, -end.x, -end.y, -chrom)

    # join <- fuzzyjoin::genome_join(table_ready, condensed,  by = c(
    #   "CHROM" = "chrom",
    #   "POS" = "start",
    #   "POS" = "end"),
    #   mode = "left") %>%
    #   select(-chrom, -start, -end)
} else {
    join <- table_ready %>%
        dplyr::mutate(DIVERGENT = NA)
}

It seems that we are just tring to join a NULL data.table (whatever that is) with our annotation and that just doens't work. This is probably a problem with the channel-inputs.

Input tuple for make_flat_file:

tuple file(bcsq_samples_parsed), file(div_file), file(bcsq_scores), file(wbgene_names)

Current channel wrangling

bcsq_parse_samples.out
.combine(bcsq_parse_scores.out)
.combine(Channel.fromPath("${reference_dir}/csq/${params.species}.gff")) | make_flat_file

output for bcsq_parse_samples

output:
    tuple file("BCSQ_samples_parsed.tsv"), file("div.bed")

output for bcsq_parse_scores

output:
    path "BCSQ_scores_parsed.tsv"

Our current input tuple should be:

file("BCSQ_samples_parsed.tsv"), file("div.bed") path "BCSQ_scores_parsed.tsv" Channel.fromPath("${reference_dir}/csq/${params.species}.gff"

So it looks like the input cardinality matches.

The other potential source of the issue is that we are not loading the divergent region file. Lets check if it was the correct file in the bcsq_parse_samples process....

Its also blank. Well thats because I didn't specify one. This is a bug that we should fix. The make_flat_file script automatically tries to join the divergent region bed, even if the file is blank when the species ID is C. elegans. We should just have the script check if it can load the data or not. If not then print an erorr message and append NA for the divergent column.