anands-repo / hello

DNN-based small variant caller
MIT License
13 stars 0 forks source link

Failed to open FASTA file /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta #10

Open robertzeibich opened 1 year ago

robertzeibich commented 1 year ago

I am getting the following error:

2022-12-13 15:53:06,677 INFO:Finished hotspot generation for all chromosomes. Launching all caller commands.
Caller progress:  42%|████▏     | 4986/11962 [19:46:04<20:30:57, 10.59s/it][E::fai_load3_core] Failed to open FASTA file /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
2022-12-14 11:39:11,762 ERROR:Failure when executing Caller for args Namespace(activity='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfs/GALF011/hotspots_chr10_bams___GALF011__bam/shard225.txt', bam='/scratch/xm41/ct/bams/GALF011.bam', chrPrefixes=None, clr=False, debug=False, featureLength=150, highconf=None, hotspotMode='BOTH', hybrid_eval=False, hybrid_hotspot=False, include_hp=False, intersectRegions=False, keep_hdf5=False, log='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfs/GALF011/features_bams___GALF011__bam/features4989.log', mapq_threshold=5, network='/opt/hello/models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn', noAlleleLevelFilter=False, only_contained=False, outputPrefix='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfs/GALF011/features_bams___GALF011__bam/features4989', pacbio=None, provideFeatures=True, q_threshold=10, reconcilement_size=10, ref='/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta', reuseSearchers=False, simple=True, test_labeling=False, truth=None)
Caller progress:  42%|████▏     | 4986/11962 [19:46:05<27:39:28, 14.27s/it]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/hello/python/call.py", line 71, in launcher
    raise e
  File "/opt/hello/python/call.py", line 68, in launcher
    return functor(args_)
  File "/opt/hello/python/caller_calling.py", line 872, in main
    for i, siteDict in enumerate(datagen):
  File "/opt/hello/python/trainDataTools.py", line 1029, in data
    ref = ReferenceCache(database=reference);
  File "/opt/hello/python/PySamFastaWrapper.py", line 9, in __init__
    self.handle = pysam.FastaFile(database);
  File "pysam/libcfaidx.pyx", line 123, in pysam.libcfaidx.FastaFile.__cinit__
  File "pysam/libcfaidx.pyx", line 183, in pysam.libcfaidx.FastaFile._open
OSError: error when opening file `/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta`
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/hello/python/call.py", line 344, in <module>
    main(args)
  File "/opt/hello/python/call.py", line 220, in main
    total=len(caller_args),
  File "/opt/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/miniconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
OSError: error when opening file `/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta`

Do you know why?

Version: hello="/usr/local/hello/d497e3e/"

robertzeibich commented 1 year ago

Somehow, I am getting the fasta error and I do not know why. Should I use the old verion: 709b7c3? Currently, I do not know what to do and how to fix it.

robertzeibich commented 1 year ago

I also have the feeling that despite multiprocessing enabled, the new version is much slower than the old version.

robertzeibich commented 1 year ago

Can you tell me if the identified variants differ between the old (hello/709b7c3 ) and new version (hello/d497e3e)?

anands-repo commented 1 year ago

The way fasta files are read is the same in both versions - which is, using pysam, a python interface for samtools. The error you posted indicates an issue with pysam's ability to read the fasta file. One workaround I can think of is to delete all the fasta indices that are already there, and reindex the fasta file using samtools.

When refactoring the code to the new version, I did not change the functionality.

As you note, previously GNU parallel was used, now multiprocessing is used. My thinking is that they should be equivalent in terms of speed.

If the two versions are run on the same input files, I do not expect any material difference in the output, as the algorithms are the same (by "material difference", I mean, a plain diff between the two VCFs may show a difference, whereas a hap.py evaluation wouldn't). If there is some difference that is material, it probably means I changed something unintentionally between the two versions. But in some tests I ran, I didn't notice any such difference.

I prefer the newer version, personally, because it does better error handling and reporting.

anands-repo commented 1 year ago

The version before the major refactoring, by the way, would be the branch devel_bugfix. However, it doesn't have the improvements in the UI that are included in the new version, and may pose difficulties in error handling and debugging issues.

There were bugs in the previous commits within branch repo, and using those may result in bad results. I think the older commit you mention is in the branch repo and advise not to use it.

In my previous post, when I referred to new and old versions, I am comparing repo branch (latest commit) and the devel_bugfix branch, respectively.

robertzeibich commented 1 year ago

Thanks for the quick reply.

I noticed a fasta.index and a fasta.fai file in the folder of the fasta file. I moved the fasta.index into another folder and reindexed the fasta, but I do not see a difference between the previous fasta.fai and the current fasta.fai. Why does the tool inform me at a Caller progress of 42% that there seems to be an issue with the fasta.fai? Isn't that a bit late?

The time difference seems huge. When I run the commit 709b7c3 with 4 cpus-per-tak, I have results after a few hours. When I run the latest commit d497e3e with 4 cpus-per-task, I have results after days (1 day+ at least).

I also contacted the team who set up the latest version. If there is something strange with the set up going on, they should notice.

anands-repo commented 1 year ago

The error command reports a log: '/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfs/GALF011/features_bams___GALF011__bam/features4989.log

Could you tell me what's in it? The exact log filename may be different if you ran it again, so I would look inside the latest failure log to find the name of the correct log file.

Basically, chromosome 10 (and other chromosomes) are split into pieces and the caller is run on each piece. It looks like 42% of the pieces did not face any issue when trying to read the reference. But one of the pieces had an issue accessing the reference. Not sure why pysam develops an issue like this.

Also surprising that the older version seems to have run to completion. In the older version it should also report the error, but at the tail end after running everything, instead of, after 42% have run.

anands-repo commented 1 year ago

Here is the same error: https://stackoverflow.com/questions/51059608/python-error-oserror-error-when-opening-file-all-fprau-fasta-fasta-file-too

Seems pysam posed an error in this case for a large fasta file, but not in the case of a small fasta file. Could be a memory issue if it happens only for large fasta files. But I never faced this issue with opening the fasta file. How much memory does the machine have?

To be on the safe side, I would remove the .fai file as well before indexing the reference. There are also other indexes such as the faidx index etc. I do not clearly remember these.

You can also try upgrading the pysam version pip install --upgrade pysam.

robertzeibich commented 1 year ago

I also thought that this could be a memory issue, therefore, I have increased the memory usage.

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=200G

These are my current settings. If hello terminates at a different percentage, I believe that the issue might be a memory issue.

If I have to rerun, I will remove the file and re-index. I do not want to sabotage the current run.

Do you know which version you use in your docker container? I am asking because that is what we have currently deployed.

Once I run in the same issue again, I will look and share the file with you because I have already deleted the old file. Currently, I am at 23%. So, tomorrow, we might already know more.

robertzeibich commented 1 year ago

Stopped at 33% this time:

Caller progress:  33%|███▎      | 3988/11955 [20:06:32<35:35:50, 16.08s/it]
Caller progress:  33%|███▎      | 3996/11955 [20:08:29<27:11:35, 12.30s/it][E::fai_load3_core] Failed to open FASTA file /scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta
2022-12-19 02:40:59,817 ERROR:Failure when executing Caller for args Namespace(activity='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfsNEW/GALF003/hotspots_chr8_bams___GALF003__bam/shard275.txt', bam='/scratch/xm41/ct/bams/GALF003.bam',
chrPrefixes=None, clr=False, debug=False, featureLength=150, highconf=None, hotspotMode='BOTH', hybrid_eval=False, hybrid_hotspot=False, include_hp=False, intersectRegions=False, keep_hdf5=False, log='/scratch/xm41/ct/bamsDown/30x/HELLO/
step2_vcfsNEW/GALF003/features_bams___GALF003__bam/features3998.log', mapq_threshold=5, network='/opt/hello/models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn', noAlleleLevelFilter=False, only_contained=False,
 outputPrefix='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfsNEW/GALF003/features_bams___GALF003__bam/features3998', pacbio=None, provideFeatures=True, q_threshold=10, reconcilement_size=10, ref='/scratch/xm41/hg38_resources/resources_br
oad_hg38_v0_Homo_sapiens_assembly38.fasta', reuseSearchers=False, simple=True, test_labeling=False, truth=None)
Caller progress:  33%|███▎      | 3996/11955 [20:08:32<40:07:05, 18.15s/it]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
  File "/opt/miniconda3/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/opt/hello/python/call.py", line 71, in launcher
    raise e
  File "/opt/hello/python/call.py", line 68, in launcher
    return functor(args_)
  File "/opt/hello/python/caller_calling.py", line 872, in main
    for i, siteDict in enumerate(datagen):
  File "/opt/hello/python/trainDataTools.py", line 1029, in data
    ref = ReferenceCache(database=reference);
  File "/opt/hello/python/PySamFastaWrapper.py", line 9, in __init__
    self.handle = pysam.FastaFile(database);
  File "pysam/libcfaidx.pyx", line 123, in pysam.libcfaidx.FastaFile.__cinit__
  File "pysam/libcfaidx.pyx", line 183, in pysam.libcfaidx.FastaFile._open
OSError: error when opening file `/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta`
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/hello/python/call.py", line 344, in <module>
    main(args)
  File "/opt/hello/python/call.py", line 220, in main
    total=len(caller_args),
  File "/opt/miniconda3/lib/python3.7/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/opt/miniconda3/lib/python3.7/multiprocessing/pool.py", line 748, in next
    raise value
OSError: error when opening file `/scratch/xm41/hg38_resources/resources_broad_hg38_v0_Homo_sapiens_assembly38.fasta

/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfsNEW/GALF003/features_bams_GALF003bam/features3998.log

2022-12-19 02:40:52,156 INFO:Launching caller with arguments Namespace(activity='/scratch/xm41/ct/bamsDown
/30x/HELLO/step2_vcfsNEW/GALF003/hotspots_chr8_bams___GALF003__bam/shard275.txt', bam='/scratch/xm41/ct/bams/GALF003.bam', chrPrefi
xes=None, clr=False, debug=False, featureLength=150, highconf=None, hotspotMode='BOTH', hybrid_eval=False, hybrid_hotspot=False, include_hp=False, intersectRegions=False, keep_hdf5=False, log='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vc
fsNEW/GALF003/features_bams___GALF003__bam/features3998.log', mapq_threshold=5, network='/opt/hello/models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn', noAlleleLevelFilter=False, only_contained=False, outputP
refix='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfsNEW/GALF003/features_bams___GALF003__bam/features3998', pacbio=None, provideFeatures=True, q_threshold=10, reconcilement_size=10, ref='/scratch/xm41/hg38_resources/resources_broad_hg38
_v0_Homo_sapiens_assembly38.fasta', reuseSearchers=False, simple=True, test_labeling=False, truth=None)
2022-12-19 02:40:52,187 INFO:Setting up readers
2022-12-19 02:40:52,266 INFO:Getting callable sites
2022-12-19 02:40:59,731 INFO:Obtained 1 hotspots, creating data and running models
2022-12-19 02:40:59,872 INFO:Launching caller with arguments Namespace(activity='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfsNEW/GALF003/hotspots_chr9_bams___GALF003__bam/shard399.txt', bam='/scratch/xm41/ct/bams/GALF003.bam', chrPrefi
xes=None, clr=False, debug=False, featureLength=150, highconf=None, hotspotMode='BOTH', hybrid_eval=False, hybrid_hotspot=False, include_hp=False, intersectRegions=False, keep_hdf5=False, log='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vc
fsNEW/GALF003/features_bams___GALF003__bam/features4000.log', mapq_threshold=5, network='/opt/hello/models/illumina_multi_coverage_mapq_threshold_hg002_continue_run16.wrapper.dnn', noAlleleLevelFilter=False, only_contained=False, outputP
refix='/scratch/xm41/ct/bamsDown/30x/HELLO/step2_vcfsNEW/GALF003/features_bams___GALF003__bam/features4000', pacbio=None, provideFeatures=True, q_threshold=10, reconcilement_size=10, ref='/scratch/xm41/hg38_resources/resources_broad_hg38
_v0_Homo_sapiens_assembly38.fasta', reuseSearchers=False, simple=True, test_labeling=False, truth=None)
2022-12-19 02:40:59,908 INFO:Setting up readers

Do you see anything specific in the log?

robertzeibich commented 1 year ago

I just looked at the python and pysam versions:

d497e3e --> new python: 3.7.6 pysam: 0.15.3

709b7c3 --> old python: 3.7.7 pysam 0.15.3

The pysam versions seem to be the same, but the python version differ. Would you change the python version to 3.7.7?

robertzeibich commented 1 year ago

I just ran the singularity container and opened the fasta file with pyam.FastaFile. I did not get an error.

robertzeibich commented 1 year ago

Changed the settings again:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=10
#SBATCH --mem=400G

Let's see if I still get the same error. Hopefully, the run finishes quickly. I also executed a run with just the first two chromosomes.

anands-repo commented 1 year ago

Hi,

Thanks for bringing this to my attention! In addition, running the two different commits on your end made it easier for me to analyze the issue.

I went through the code commit versions. Consistent with your observations, I found reasons why the new version will be slower and may have memory-requirement issues. I will need some time to fix this, and as I cannot allocate the necessary time for this right now, I will have to come back to this later. However, I can recommend an older commit for your purpose.

After going through the commit history, the recommended version is branch devel_bugfix (commit ID bf8e822666d1c254234efeab8f4f76b7d56bb7a6). I believe you have used commit ID 709b7c36d6f5560c50335c97d3961dd540196940 before (please check the full SHA to make sure)? Since the only difference between the two commits is a difference in README, the latter commit is valid to use. Please note that in the README of that commit, it is recommended to fix the VCF file at the end using a shell command, as the VCF from that commit does not conform to the correct format as required by hap.py. If you are using another tool other than hap.py to process the VCF, other changes may be needed to fix the VCF as well.

Thanks, once again for trying different options and highlighting this issue.

Best Regards

On Sun, Dec 18, 2022 at 3:27 PM robertzeibich @.***> wrote:

Changed the settings again:

SBATCH --ntasks=1

SBATCH --cpus-per-task=10

SBATCH --mem=400G

Let's see if I still get the same error. Hopefully, the run finishes quickly.

— Reply to this email directly, view it on GitHub https://github.com/anands-repo/hello/issues/10#issuecomment-1356902211, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADJJMQEOG5ZIL2VTZVZN3RTWN6MXHANCNFSM6AAAAAAS6HMQEM . You are receiving this because you commented.Message ID: @.***>

robertzeibich commented 1 year ago

Thanks for getting back to me.

I know about the troubles with the following commit https://github.com/anands-repo/hello/commit/709b7c36d6f5560c50335c97d3961dd540196940. We have already had this conversation in the past.

https://github.com/anands-repo/hello/commit/d497e3e5a5a293c6c077e656f76bdbc79ed93882: If hello runs through with the increased memory, are the results still trustworthy or would you recommend not to use them?

Please, let me know when you fixed the issue.

anands-repo commented 1 year ago

I do not think the latest commit will complete, and I expect it to fail again. What I previously said about commit 709b7c36d6f5560c50335c97d3961dd540196940 was incorrect, so sorry about that (I mistook it for a different commit which had issues).

So my recommendation is to run commit 709b7c36d6f5560c50335c97d3961dd540196940, and then fix the VCF file by using the following shell command (there are some formatting errors in the VCF header for this version):

#!/bin/bash

workdir= # Output directory

fix_vcf() {
    correct_string="##INFO=<ID=MixtureOfExpertPrediction,Type=String,Number=1,Description=\"Mean predictions from experts\">"
    cat $1 | sed "s?##INFO=<ID=MixtureOfExpertPrediction,Description=\"Mean predictions from experts\"?$correct_string?g" > $2
}

fix_vcf $workdir/results.mean.vcf $workdir/results.mean.corrected.vcf
robertzeibich commented 1 year ago

Your assumption was correct. The job failed at 83%.

Thanks for providing the bash script.