VanLoo-lab / ascat

ASCAT R package
https://www.mdanderson.org/research/departments-labs-institutes/labs/van-loo-laboratory/resources.html#ASCAT
162 stars 85 forks source link

ascat.prepareHTS ERROR in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 4 elements #172

Closed tanayb001 closed 6 months ago

tanayb001 commented 6 months ago

Hi

I got the following error while running ascat.prepareHTS :

ascat.prepareHTS( tumourseqfile = "Tumour.bam", normalseqfile = "Normal.bam", tumourname = "Tumour_name", normalname = "Normal_name", allelecounter_exe = "/usr/bin/alleleCounter", alleles.prefix = "G1000_alleles_hg38_chr", loci.prefix = "G1000_loci_hg38_chr", gender = "XY", genomeVersion = "hg38", nthreads = 36, tumourLogR_file = "Tumor_LogR.txt", tumourBAF_file = "Tumor_BAF.txt", normalLogR_file = "Germline_LogR.txt", normalBAF_file = "Germline_BAF.txt") Reading locis Reading locis . . Done reading locis Multi pos start: Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 4 elements

I am not able to understand that in which file I should have 4 elements which is missing? or it is something else. Please let me know hoe to solve this. Thank you.

Regards, Tanay

sunnaa0423 commented 6 months ago

Hi there

I got stuck in ascat.prepareHTS too

library(ASCAT)
ascat.prepareHTS(
  tumourseqfile = "/data/home/xxx/share/xxx/xxx/input/tumor_hg38_sort_rmdup.bam",
  normalseqfile = "/data/home/xxx/share/xxx/xxx/input/normal_hg38_sort_rmdup.bam",
  tumourname = "tumour",
  normalname = "normal",
  allelecounter_exe = "/data/home/xxx/xxx/xxx/biosoft/miniconda3/envs/Mut/bin/alleleCounter",
  alleles.prefix = "/data/home/xxx/share/xxx/ASCAT_pipline_test/ASCAT_xxx_hg38/G1000_alleles_WGS_hg38/G1000_alleles_hg38_chr",
  loci.prefix = "/data/home/xxxx/share/xxxx/ASCAT_pipline_test/ASCAT_xxx_hg38/G1000_loci_WGS_hg38/G1000_loci_hg38_chr",
  gender = "XY",
  genomeVersion = "hg38",
  nthreads = 8,
  tumourLogR_file = "Tumor_LogR.txt",
  tumourBAF_file = "Tumor_BAF.txt",
  normalLogR_file = "Germline_LogR.txt",
  normalBAF_file = "Germline_BAF.txt",
  ref.fasta =
"/share/home/xxx/share/tools/Homo_sapiens_UCSC_hg38/Homo_sapiens/UCSC/hg38/Sequence/WholeGenomeFasta/genome.fa")

and it reports that "Error in { : task 1 failed - "EXIT_CODE == 0 is not TRUE" Calls: ascat.prepareHTS -> %dopar% -> Execution halted" in the end. The complete log file is in the attachment. ascat.log

Thanks

tlesluyes commented 6 months ago

Hi @tanayb001,

It looks like ASCAT crashes right during or after running alleleCounter, could you inspect the files called *_alleleFrequencies_chr*? How many files do you have, do they look fine and do they have as many lines as the loci files?

4 elements sound weird though. The loci files should have 2 columns (chr/pos) whereas the alleleFrequencies files should have 7 columns (chr/pos/A/C/G/T/tot). Only the allele files should have 4 columns (chr/pos/ref/alt), could you have a look at those (they should have as many lines as the loci files +1 because of the header)?

Cheers,

Tom.

tlesluyes commented 6 months ago

Hi @sunnaa0423,

This looks like a different issue than the original one posted here. The log mentions:

[E::sam_itr_next] Null iterator
[ERROR] (src/bam_access.c: bam_access_get_multi_position_base_counts:379 errno: No such file or directory) Error detected (-2) when trying to iterate through region.
[ERROR] (./src/alleleCounter.c: main:432 errno: None) Error scanning through bam file for loci list with dense snps.

This is related to an issue when running alleleCounter. Is the bam file indexed? Is the bam located at the expected place? Are chromosome names correct?

Please see the following link: https://github.com/VanLoo-lab/ascat/issues/168

Cheers,

Tom.

tanayb001 commented 6 months ago

Hi @tanayb001,

It looks like ASCAT crashes right during or after running alleleCounter, could you inspect the files called *_alleleFrequencies_chr*? How many files do you have, do they look fine and do they have as many lines as the loci files?

4 elements sound weird though. The loci files should have 2 columns (chr/pos) whereas the alleleFrequencies files should have 7 columns (chr/pos/A/C/G/T/tot). Only the allele files should have 4 columns (chr/pos/ref/alt), could you have a look at those (they should have as many lines as the loci files +1 because of the header)?

Cheers,

Tom.

Hi @tlesluyes The normal _alleleFrequencies_chr files are looking fine but the tumor _alleleFrequencies_chr files do not have any data it contains only header of 7 columns (chr/pos/A/C/G/T/Good_depth). The reference allele files have three columns as "position", "a0" and "a1". I'm not getting where does it require 4 element? Please let me know how to solve this.

Regards, Tanay

tlesluyes commented 6 months ago

Hi @tanayb001,

My bad, the allele files should contain 3 columns (not 4) so they look fine.

It's weird because ASCAT runs alleleCounter on the tumour first and then processes the normal. Unexpectedly, you do have results for the normals but empty files for the tumour. Were the tumour and normal processed the same way with the same reference and chromosome names (e.g. "chr1" is not the same as "1")?

It looks like alleleCounter did run without any issues so the first part in ascat.prepareHTS went (somehow) fine. The next step is the (internal) ascat.getBAFsAndLogRs function where the logR and BAF files are supposed to be generated. But the very first step is reading the tumour alleleFrequencies files and it must crash because at least one of those is empty.

At this point, I suggest you remove all temporary files that you have and manually run functions within ascat.prepareHTS (https://github.com/VanLoo-lab/ascat/blob/master/ASCAT/R/ascat.prepareHTS.R#L244) so I can understand where the issue occurs, what is going wrong and why this is hapenning:

suppressPackageStartupMessages(library(ASCAT))
suppressPackageStartupMessages(library(foreach))
suppressPackageStartupMessages(library(doParallel))
suppressPackageStartupMessages(library(parallel))

tumourseqfile="Tumour.bam"
normalseqfile="Normal.bam"
tumourname="Tumour_name"
normalname="Normal_name"
allelecounter_exe="/usr/bin/alleleCounter"
alleles.prefix="G1000_alleles_hg38_chr"
loci.prefix="G1000_loci_hg38_chr"
gender="XY"
genomeVersion="hg38"
nthreads=24 # ASCAT runs chromosomes in parallel so 24 is the maximum it needs
tumourLogR_file="Tumor_LogR.txt"
tumourBAF_file="Tumor_BAF.txt"
normalLogR_file="Germline_LogR.txt"
normalBAF_file="Germline_BAF.txt"
minCounts=10
BED_file=NA
probloci_file=NA
chrom_names=c(1:22,'X')
min_base_qual=20
min_map_qual=35
ref.fasta=NA
skip_allele_counting_tumour=F
skip_allele_counting_normal=F
seed=as.integer(Sys.time())

doParallel::registerDoParallel(cores=nthreads)

if (is.na(tumourLogR_file)) tumourLogR_file=paste0(tumourname,"_tumourLogR.txt")
if (is.na(tumourBAF_file)) tumourBAF_file=paste0(tumourname,"_tumourBAF.txt")
if (is.na(normalLogR_file)) normalLogR_file=paste0(tumourname,"_normalLogR.txt")
if (is.na(normalBAF_file)) normalBAF_file=paste0(tumourname,"_normalBAF.txt")

if (!skip_allele_counting_tumour) {
  # Obtain allele counts at specific loci for tumour
  foreach::foreach(CHR=chrom_names) %dopar% {
    ascat.getAlleleCounts(seq.file=tumourseqfile,
                          output.file=paste0(tumourname,"_alleleFrequencies_chr", CHR, ".txt"),
                          loci.file=paste0(loci.prefix, CHR, ".txt"),
                          min.base.qual=min_base_qual,
                          min.map.qual=min_map_qual,
                          allelecounter.exe=allelecounter_exe,
                          ref.fasta=ref.fasta)
  }
}
############################################################################################
# At this point, do all of the tumour alleleFrequencies files exist and do they look fine? #
############################################################################################

if (!skip_allele_counting_normal) {
  # Obtain allele counts at specific loci for normal
  foreach::foreach(CHR=chrom_names) %dopar% {
    ascat.getAlleleCounts(seq.file=normalseqfile,
                          output.file=paste0(normalname,"_alleleFrequencies_chr", CHR, ".txt"),
                          loci.file=paste0(loci.prefix, CHR, ".txt"),
                          min.base.qual=min_base_qual,
                          min.map.qual=min_map_qual,
                          allelecounter.exe=allelecounter_exe,
                          ref.fasta=ref.fasta)
  }
}
############################################################################################
# At this point, do all of the normal alleleFrequencies files exist and do they look fine? #
############################################################################################

# Obtain BAF and LogR from the raw allele counts
ascat.getBAFsAndLogRs(samplename=tumourname,
                      tumourAlleleCountsFile.prefix=paste0(tumourname,"_alleleFrequencies_chr"),
                      normalAlleleCountsFile.prefix=paste0(normalname,"_alleleFrequencies_chr"),
                      tumourLogR_file=tumourLogR_file,
                      tumourBAF_file=tumourBAF_file,
                      normalLogR_file=normalLogR_file,
                      normalBAF_file=normalBAF_file,
                      alleles.prefix=alleles.prefix,
                      gender=gender,
                      genomeVersion=genomeVersion,
                      chrom_names=chrom_names,
                      minCounts=minCounts,
                      BED_file=BED_file,
                      probloci_file=probloci_file,
                      seed=seed)
###################################
# Is this where the error occurs? #
###################################

# Synchronise all information
ascat.synchroniseFiles(samplename=tumourname,
                       tumourLogR_file=tumourLogR_file,
                       tumourBAF_file=tumourBAF_file,
                       normalLogR_file=normalLogR_file,
                       normalBAF_file=normalBAF_file)

Cheers,

Tom.

tanayb001 commented 6 months ago

Hi @tlesluyes

The Tumour and Normal BAM files are generated by following GATK pipeline using the same genome fasta file.

I re-downloaded the reference files and ran the same command as I mentioned in the first post. This time also it generated the same error "line 1 did not have 4 elements". However, this time all the allelefrequency file have data both the tumour and normal. Here is the command:

ascat.prepareHTS('P01-TD.fixmate.sorted.duprm.recal.bam', 'P01-BD.fixmate.sorted.duprm.recal.bam', tumourname='P01-TD', normalname='P01-BD', allelecounter_exe='/usr/bin/alleleCounter', loci.prefix='/referenceFiles/G1000_lociAll_hg38/G1000_loci_hg38_chr', alleles.prefix='/referenceFiles/G1000_allelesAll_hg38/G1000_alleles_hg38_chr', gender='XY', chrom_names = c(1:22, 'X', 'Y'), genomeVersion='hg38', BED_file ='S33613271_Covered.bed', ref.fasta = '/home/hg38.fa', nthreads = 24, tumourLogR_file = "P01-TD_LogR.txt", tumourBAF_file = "P01-TD_BAF.txt", normalLogR_file = "P01-BD_LogR.txt", normalBAF_file = "P01-BD_BAF.txt") Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 4 elements

As per your suggestion I also manually run functions within ascat.prepareHTS using you codes. While running it generated the freq file like this:

doParallel::registerDoParallel(cores=nthreads) if (is.na(tumourLogR_file)) tumourLogR_file=paste0(tumourname,"_tumourLogR.txt") if (is.na(tumourBAF_file)) tumourBAF_file=paste0(tumourname,"_tumourBAF.txt") if (is.na(normalLogR_file)) normalLogR_file=paste0(tumourname,"_normalLogR.txt") if (is.na(normalBAF_file)) normalBAF_file=paste0(tumourname,"_normalBAF.txt") if (!skip_allele_counting_tumour) { foreach::foreach(CHR=chrom_names) %dopar% { ascat.getAlleleCounts(seq.file=tumourseqfile, output.file=paste0(tumourname,"_alleleFrequencies_chr", CHR, ".txt"), loci.file=paste0(loci.prefix, CHR, ".txt"), min.base.qual=min_base_qual, min.map.qual=min_map_qual, allelecounter.exe=allelecounter_exe, ref.fasta=ref.fasta) } } Done reading locis Multi pos start: [[1]] NULL Till [[23]]

However, it generated the same error while running ascat.getBAFsAndLogRs command like this:

ascat.getBAFsAndLogRs(samplename=tumourname, tumourAlleleCountsFile.prefix=paste0(tumourname,"_alleleFrequencies_chr"), normalAlleleCountsFile.prefix=paste0(normalname,"_alleleFrequencies_chr"),
tumourLogR_file=tumourLogR_file, tumourBAF_file=tumourBAF_file, normalLogR_file=normalLogR_file, normalBAF_file=normalBAF_file, alleles.prefix=alleles.prefix, gender=gender, genomeVersion=genomeVersion, chrom_names=chrom_names, minCounts=minCounts, BED_file=BED_file, probloci_file=probloci_file, seed=seed) Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : line 1 did not have 4 elements

I do not see any allele frequency file for chrY in both tumour and normal sample.

I don't know what is the problem. Please let me know how to solve this. Thank you.

Regards, Tanay

tlesluyes commented 6 months ago

Hi @tanayb001,

Why is a BED file now used but wasn't present in your original post? Was it used previously as well? Can you show the first 5 lines? Is the function running with BED_file=NA?

Cheers,

Tom.

tanayb001 commented 6 months ago

Hi @tlesluyes

BED file was used previously as well. But I tried using without it also, but it was asking for it. My bad, I mistakenly forgot to mention that in my post.

I checked now and without giving the BED file it is running and generating the BAF and logR file fro both tumour and normal. However, since it is recommended to use BED file for WES data, is it fine if I don't use the BED file? My bed file looks like this:

browser position chr1:14446-14814 track name="Covered" description="Agilent SureSelect DNA - SureSelectXT HS Human All Exon V8 - Genomic regions covered by probes" color=0,0,128 db=hg38 chr1 14445 14814 ref|WASH7P,ref|NR_024540,ens|ENST00000488147 chr1 14928 15048 ref|WASH7P,ref|NR_024540,ens|ENST00000488147

Now I see, that this file has 4 columns. Should I remove the Gene names column and use it?

If this is fine then what should be the next step? What parameters do you recommend if my tumour sample is sequenced at 200X depth and the Normal sample is sequenced at 100X depth?

Thank you.

Regards, Tanay

tlesluyes commented 6 months ago

Hi @tanayb001,

Makes sense now: the reason it's crashing is because the first two lines of your BED file correspond to a kind of header. Such a format cannot be read by R's read.table() because of differences in the number of columns. Please remove such lines and try again.

Keeping extra columns (e.g. gene names, strand, score, etc.) is perfectly fine as long as the 3 first columns correspond to chr, start and end.

EDIT: Of note, your coverage seems high enough so you can use the default parameters.

Cheers,

Tom.

tanayb001 commented 6 months ago

Hi @tlesluyes

I have remobed the 4th column from my BED file and now ascat.prepareHTS running without any error. I have run the following command after that:

> ascat.bc = ascat.loadData(Tumor_LogR_file = "P04-TD_LogR.txt", Tumor_BAF_file = "P04-TD_BAF.txt", 
+ Germline_LogR_file = "P04-BD_LogR.txt", Germline_BAF_file = "P04-BD_BAF.txt", gender = 'XY', 
+ genomeVersion = "hg38")

[1] Reading Tumor LogR data... [1] Reading Tumor BAF data... [1] Reading Germline LogR data... [1] Reading Germline BAF data... [1] Registering SNP locations... [1] Splitting genome in distinct chunks... > ascat.plotRawData(ascat.bc, img.prefix = "Before_correction_") [1] Plotting tumor data [1] Plotting germline data

> ascat.bc = ascat.correctLogR(ascat.bc, GCcontentfile = "/referenceFiles/GC_G1000_WES_hg38/GC_G1000_hg38.txt", 
+ replictimingfile = "/referenceFiles/RT_G1000_WES_hg38/RT_G1000_hg38.txt")

[1] Sample P04-TD (1/1) GC correlation: 25bp 0.30 ; 50bp 0.34 ; 100bp 0.36 ; 200bp 0.39 ; 500bp 0.42 ; 1kb 0.44 ; 2kb 0.48 ; 5kb 0.55 ; 10kb 0.58 ; 20kb 0.59 ; 50kb 0.61 ; 100kb 0.61 ; 200kb 0.60 ; 500kb 0.57 ; 1Mb 0.53 ; Short window size: 1kb Long window size: 100kb Replication timing correlation: Bg02es 0.24 ; Bj 0.24 ; Gm06990 0.29 ; Gm12801 0.30 ; Gm12812 0.27 ; Gm12813 0.28 ; Gm12878 0.29 ; Helas3 0.26 ; Hepg2 0.32 ; Huvec 0.28 ; Imr90 0.27 ; K562 0.33 ; Mcf7 0.38 ; Nhek 0.33 ; Sknsh 0.40 ; Replication dataset: Sknsh > ascat.plotRawData(ascat.bc, img.prefix = "After_correction_") [1] Plotting tumor data [1] Plotting germline data > ascat.bc = ascat.aspcf(ascat.bc) [1] Sample P04-TD (1/1) > ascat.plotSegmentedData(ascat.bc) > ascat.output = ascat.runAscat(ascat.bc, gamma=1, write_segments = T) [1] Sample P04-TD (1/1) > QC = ascat.metrics(ascat.bc,ascat.output) > save(ascat.bc, ascat.output, QC, file = 'ASCAT_objects.Rdata')

This is generating a very noisy plot after correction also which is like this: p4td_after_correction

Please let me know id this is fine or do I need to change any parameters. Also, if I want to use the segment file for further cohort-wise copy number analysis (like GISTIC use) which segment file do I need to use and what will be my segment mean value (here I do not see any of such values).

Thank you.

Regards, Tanay

tlesluyes commented 6 months ago

Hi @tanayb001,

It depends on whether we're talking about the pre-correction or post-correction plot but still, it's not extremely noisy to me especially if it's a true biological signal with matched changes in the BAF track. If you feel like the profile is oversegmented, you may increase the penalty value to 100 or 140 when running ascat.aspcf.

The segmented data can be seen in the ASPCF plot and ascat.bc$Tumor_LogR_segmented and ascat.bc$Tumor_BAF_segmented.

Closing the issue now since it has been solved.

Cheers,

Tom.