LiuzLab / TraceQC

Other
6 stars 1 forks source link

Test with full FASTQ files. #12

Closed hyunhwan-jeong closed 4 years ago

hyunhwan-jeong commented 4 years ago

After today's discussion, I could locate where the data comes from. This is the paper of the data source: https://www.nature.com/articles/nmeth.4108

Also in the FASTQ files, I can find SRA IDs for each FASTQ file, so I think I have gotten the information.

Originally posted by @hyunhwaj in https://github.com/LiuzLab/TraceQC/issues/5#issuecomment-636400960

I am going to test TraceQC with the files.

hyunhwan-jeong commented 4 years ago

With the following script, I was able to run TraceQC v.0.1.0.

library(TraceQC)
library(fastqcr)

inp_file <- "~/TraceQC_test/inst/fastq/SRR4842510.fastq"
ref_file <- system.file("extdata", "test_data", "ref",
                        "ref.txt", package="TraceQC")

qc_dir <- "~/TraceQC_test/inst/fastqc"
fastqc("~/TraceQC_test/inst/fastq/",
       qc.dir=qc_dir)

input_qc_path <- get_qcpath(inp_file, qc_dir)

obj <- TraceQC(input_file = inp_file,
               ref_file = ref_file,
               fastqc_file = input_qc_path,
               ncores = 8)

generate_qc_report(
  input_file = normalizePath(inp_file),
  ref_file = normalizePath(ref_file),
  fastqc_dir = normalizePath(qc_dir),
  output_path = normalizePath("~/TraceQC_test/inst/output/SRR4842510.html"),
  ncores = 8,
  title = "TraceQC report abot SRR4842510",
  preview = FALSE
)

I encountered errors for relative paths, so I used normalizePath for the test run. It was okay, and I have revised the code to avoid the problem. I also got two warnings during the run.

Warning messages:
1: In normalizePath("~/TraceQC_test/inst/output/SRR4842510.html") :
  path[1]="/mnt/data/hwan/TraceQC_test/inst/output/SRR4842510.html": No such file or directory
2: Removed 455 row(s) containing missing values (geom_path).

I will figure out how to resolve the normalizePath issue.

hyunhwan-jeong commented 4 years ago

The pass problem has been resolved with a new function(committed at c0178fb23d6658e7b90ad5a13efb72758a893959).

https://github.com/LiuzLab/TraceQC/blob/fc0bc86afcb9e4d2b879834c6ade2075f467c96b/R/util.R#L19-L24

hyunhwan-jeong commented 4 years ago

After adding the parallelization code, it is now a way faster than before.

A single file processing took 10mins with 16 cores while the previous version took about two hours.