AmpliconSuite / AmpliconSuite-pipeline

A quickstart tool for AmpliconArchitect. Performs all preliminary steps (alignment, CNV calling, seed interval detection) required prior to running AmpliconArchitect. Previously called PrepareAA.
Other
48 stars 25 forks source link

ValueError: cannot convert float NaN to integer #34

Closed WeijiaSu closed 1 year ago

WeijiaSu commented 1 year ago

Hello, I am running AmpliconSuite-pipeline and I got results successfully from most of the input bam files. However, in this particular bam input, I got this error:

########## "Wrote 005-32_cnvkit_output/hg38.fa_005-32.cns with 1064 regions /anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:380: RuntimeWarning: Mean of empty slice. avg = a.mean(axis) /anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:262: RuntimeWarning: Degrees of freedom <= 0 for slice keepdims=keepdims, where=where) Traceback (most recent call last): File "/software/AmpliconArchitect/src/amplified_intervals.py", line 126, in bamFileb2b = b2b.bam_to_breakpoint(bamFile, coverage_stats=cstats) File "/software/AmpliconArchitect/src/bam_to_breakpoint.py", line 107, in init self.median_coverage(window_list=coverage_windows) File "/software/AmpliconArchitect/src/bam_to_breakpoint.py", line 375, in median_coverage self.pair_support = max(math.ceil((wc_300_avg / 10.0) ((self.insert_size - self.read_length) / 2 / self.read_length)self.percent_proper), self.pair_support_min) ValueError: cannot convert float NaN to integer amplified_intervals.py returned a non-zero exit code. Exiting..." ##########

I am not sure where the NaN value comes from. The *_CNV_CALLS_pre_filtered.bed seems fine, with 7 entries.

Thanks Weijia

jluebeck commented 1 year ago

Hi Weijia,

Thanks for reaching out about this. Can I ask which version of AmpliconArchitect you have installed? I believe the issue is related to the contents of the file coverage.stats, which is located in the $AA_DATA_REPO directory. Would you kindly be able to do the following

cd $AA_DATA_REPO rm coverage.stats && touch coverage.stats chmod a+rw coverage.stats

Alternatively, just open the coverage.stats file, and remove its contents, then try again. It seems that for some reason (possibly a previous error when running AA), some NaN values were added.

If the issue persists after clearing the file, let us debug further.

Thanks, Jens

WeijiaSu commented 1 year ago

Hi Jens, Thanks for your reply. I am using AA version 1.3.r2. As you recommended, I did "cd $AA_DATA_REPO rm coverage.stats && touch coverage.stats chmod a+rw coverage.stats"

However, I still got the error.

"Wrote /gpfs/fs1/data/zhanglab/Weijia_Su/eccDNA/AA/005-32_cnvkit_output/hg38.fa_005-32.cns with 1064 regions /anaconda3/lib/python3.7/site-packages/numpy/lib/function_base.py:380: RuntimeWarning: Mean of empty slice. avg = a.mean(axis) /anaconda3/lib/python3.7/site-packages/numpy/core/_methods.py:262: RuntimeWarning: Degrees of freedom <= 0 for slice keepdims=keepdims, where=where) Traceback (most recent call last): File "/software/AmpliconArchitect/src/amplified_intervals.py", line 126, in bamFileb2b = b2b.bam_to_breakpoint(bamFile, coverage_stats=cstats) File "/software/AmpliconArchitect/src/bam_to_breakpoint.py", line 107, in init self.median_coverage(window_list=coverage_windows) File "/software/AmpliconArchitect/src/bam_to_breakpoint.py", line 375, in median_coverage self.pair_support = max(int(round((wc_300_avg / 10.0) ((self.insert_size - self.read_length) / 2 / self.read_length)self.percent_proper)), self.pair_support_min) ValueError: cannot convert float NaN to integer amplified_intervals.py returned a non-zero exit code. Exiting..."

It seems that the error starts from "Mean of empty slice". This is weird to me because I used exactly the same code for other batches of data, all completed successfully, but this error occurs for all the datasets in this particular batch.

Any suggestions will be appreciated. Thanks Weijia

WeijiaSu commented 1 year ago

BTW, I checked the file data_repo/coverage.stats, it is empty when this error occurs this time. Thanks

jluebeck commented 1 year ago

Hi Weijia,

That's very strange. Would you kindly be able to send me the header of your bam file? jluebeck [a t] ucsd.edu.

Thanks! Jens

WeijiaSu commented 1 year ago

Thank you Jens, I attached the bam header here. I am also trying to run it by using the fastq files instead of the bam file. I am not sure if this would help.

hg38.fa_005-32.bam.headers.txt

WeijiaSu commented 1 year ago

Hi Jens, Thanks for your help. I tried it by using the fastq files instead of the bam file. It did give me any error, but the AA results show that there's no amplicon. The AA_CNV_SEEDS.bed is empty. The rmdup_CNV_CALLS_pre_filtered.bed had 25 entries.

I wonder if it means that there's no amplicon and whether this gives the errors previously. It is suspicious to me that all of the samples from this batch have 0 amplicon.

Thanks Weijia

jluebeck commented 1 year ago

Hi Weijia,

Thanks, glad to hear the realignment worked. I am curious if you are able to share the stdout/stderr printed by the pipeline? What does the finish flag.txt file say? If you are able to provide complete log files and the command you ran that would help me a lot.

I would be interested in trying to reproduce the initial error you encountered. Can you point me to the location where the version of the hg38 genome fasta from your original bam file can be downloaded?

It is normal for the pre-filtered file to contain multiple entries and subsequently not have any in the seeds file. These are typically low-complexity regions that are subsequently filtered.

If this batch of files contains normal samples or cancer types that have a low rate of focal amplification, then it is not at all unusual to have 0 amplicons. I am happy to help review any outputs to ensure that things are working normally.

Thanks, Jens

WeijiaSu commented 1 year ago

Hi Jens, Thanks so much for your help. I attached the log files. AA_005-32.zip

The hg38 reference genome can be downloaded from the UCSC genome broswer

Thank you! Weijia

jluebeck commented 1 year ago

Hi Weijia,

Thank you for providing these outputs. Everything appears completely normal with the outputs. I will try to reproduce the initial error you had using the UCSC build of hg38 and some local fastq files.

I did see one file in the .zip you sent, "005-32", which logged the following call INFO:root:Commandline: /data/zhanglab/Weijia_Su/software/AmpliconArchitect/src/AmpliconArchitect.py --out 005-32 --bam /data/zhanglab/Weijia_Su/eccDNA/Mustafa_021423/hg38.fa_005-32.bam --ref GRCh38 --bed /data/zhanglab/Weijia_Su/eccDNA/AA/005-32_cnvkit_output/hg38.fa_005-32_CNV_CALLS.bed

In this case AA was run without any selection for seed regions (the raw CNV calls were used). This unfortunately will not work, and will either stall out or crash. AA is designed to be used on pre-selected regions of the genome where focal amplifications are believed to exist (or be in the vicinity) for that sample.

Thanks, Jens

jluebeck commented 1 year ago

Hi Weijia,

To follow up, I was able to reproduce this locally by feeding the hg38 UCSC genome to bwa mem and my own fastqs. Because of the presence of alt contigs in the UCSC hg38 build, bwa mem will not mark any read pairs as "proper". This will cause issues for any bioinformatic pipelines which rely on counting properly-paired reads. There is a good writeup here: https://gatk.broadinstitute.org/hc/en-us/articles/360037498992--How-to-Map-reads-to-a-reference-with-alternate-contigs-like-GRCH38

The solution is to map to a reference without alt-contigs (like the NCBI reference the AA data repo uses), or to provide the alt contig file to bwa on mapping, as mentioned in the guide above.

I will add a feature to detect this phenomenon when launching AA so that there is less mystery about the error for future users.

Thanks, Jens

WeijiaSu commented 1 year ago

Hi Jens, Thank you very much! This is very helpful.

jluebeck commented 1 year ago

Error message for this behavior added in 3bfd422.