EichlerLab / pav

Phased assembly variant caller
97 stars 8 forks source link

Empty FASTA issue #24

Closed YahGao closed 1 year ago

YahGao commented 1 year ago

Hi,

I'm using the primary fasta file (hifiasm.bp.p_ctg.fa) got from hifiasm to call SV. According to the instructions, I provided an empty fasta file. So the assemblies.tsv is below NAME HAP1 HAP2 sample_4450 /mypath/hifiasm.bp.p_ctg.fa.gz /mypath/empty.fa.gz

My config json is: {"reference":"/mypath/ref.fa.gz","assembly_table":"/mypath/assemblies.tsv"}

But the program failed. And the error is like "[faidx] Could not build fai index temp/sample/align/contigs_h2.fa.gz.fai". Do you have any suggestions for this issue?

Thanks

paudano commented 1 year ago

If it's a gzipped empty file, then it's not non-zero in size (it's about 29 bytes). I provide an empty FASTA file with 0 bytes. If it has a new-line or space in it, it's not zero bytes and PAV will try to read it as a FASTA file.

If your file is zero bytes (see with ls -l), then try naming it just empty.fa.

Let me know if none of this solves the problem.

YahGao commented 1 year ago

Yeah, it worked well. But after running for a while, other errors appear as follows. I'm not sure if this is a problem with the input file.

/project/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
/project/uvm_mckay/bin/pav/rules/call.snakefile:592: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
  df_insdel.to_csv(output.bed_insdel, sep='\t', index=False, compression='gzip')
[Mon Nov 21 10:03:22 2022]
Error in rule call_cigar_merge:
    jobid: 0
    output: temp/sample_4450/cigar/pre_inv/svindel_insdel_h1.bed.gz, temp/sample_4450/cigar/pre_inv/snv_snv_h1.bed.gz

RuleException:
IndexError in line 554 of /project/uvm_mckay/bin/pav/rules/call.snakefile:
list index out of range
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2168, in run_wrapper
  File "/project/uvm_mckay/bin/pav/rules/call.snakefile", line 554, in __rule_call_cigar_merge
  File "/project/uvm_mckay/bin/pav/rules/call.snakefile", line 554, in <listcomp>
  File "/home/.local/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
  File "/home/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
  File "/home/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
  File "/home/.local/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
  File "/home/.local/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 227, in read
  File "/home/.local/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 404, in _concatenate_chunks
  File "/home/.local/lib/python3.8/site-packages/pandas/util/_exceptions.py", line 32, in find_stack_level
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/inspect.py", line 1514, in stack
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/inspect.py", line 1491, in getouterframes
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/inspect.py", line 1465, in getframeinfo
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/inspect.py", line 840, in findsource
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 529, in _callback
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/concurrent/futures/thread.py", line 57, in run
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 515, in cached_or_run
  File "/software/7/apps/snakemake/5.25.0/lib/python3.8/site-packages/snakemake/executors/__init__.py", line 2199, in run_wrapper
Exiting because a job execution failed. Look above for error message
/project/uvm_mckay/bin/pav/dep/svpop/svpoplib/ref.py:121: FutureWarning: The squeeze argument has been deprecated and will be removed in a future version. Append .squeeze("columns") to the call to squeeze.

  return pd.read_csv(
/project/bin/pav/dep/svpop/svpoplib/ref.py:121: FutureWarning: The squeeze argument has been deprecated and will be removed in a future version. Append .squeeze("columns") to the call to squeeze.
paudano commented 1 year ago

I think different versions of pandas are being picky about different things. What reference are you using, and does it have numeric chromosome names (e.g. "1") or strings (e.g. "chr1")? I think that's what it's complaining about when writing the table (a mix of numbers and strings for the chromosome names). I can fix the rule that's having this problem, but I might need to run it to fix all the other rules that might also fail for the same reason.

YahGao commented 1 year ago

Actually, the species I'm focusing on is cattle. I'm using the reference downloaded from Ensembl, and it has numerical chromosome names (eg "1").

paudano commented 1 year ago

I am developing a test for this, and I'll fix any problems I find. I'm also planning to share docker and singularity images, which I will also testing. I'll let you know when I have something.

paudano commented 1 year ago

I have fixed any bugs I could find with numeric references. Can you test with the latest version (2.2.0)?

YahGao commented 1 year ago

Thanks for the great update. When I use the main fasta file obtained from hifiasm it runs smoothly. But using the haplotype file with hifiasm , I get some different errors. If the input format is a compressed file (i.e. fa.gz), I get the same error as pasted above. When I changed to using fa files, it worked for a while but still failed with the following error message.

rule call_merge_haplotypes_batch:
    input: data/ref/merge_batch.tsv.gz, results/sample/bed/pre_merge/h1/svindel_ins.bed.gz, results/sample/bed/pre_merge/h2/svindel_ins.bed.gz, results/sample/callable/callable_regions_h1_500.bed.gz, results/sample/callable/callable_regions_h2_500.bed.gz
    output: temp/sample/bed/bychrom/svindel_ins/6.bed.gz
    jobid: 208
    wildcards: asm_name=sample, vartype_svtype=svindel_ins, batch=6
    threads: 8
    resources: tmpdir=/local/bgfs/myfolder8358701

/project/bin/pav_2.2.0/dep/svpop/svpoplib/svmerge.py:506: DtypeWarning: Columns (0,10) have mixed types.Specify dtype option on import or set low_memory=False.
  return merge_sample_by_support(df_support, bed_list, sample_names)
[Mon Dec 12 07:41:21 2022]
Error in rule call_merge_haplotypes_batch:
    jobid: 0
    output: temp/sample/bed/bychrom/svindel_ins/6.bed.gz

Traceback (most recent call last):
  File "/project/bin/pav_2.2.0/dep/svpop/svpoplib/variant.py", line 676, in version_id
    name_version = int(tok[1]) + 1
ValueError: invalid literal for int() with base 10: '1-69028-INS-1'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/software/7/apps/snakemake/6.12.1/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 593, in _callback
    raise ex
  File "/software/7/apps/snakemake/6.12.1/lib/python3.10/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/software/7/apps/snakemake/6.12.1/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 579, in cached_or_run
    run_func(*args)
  File "/software/7/apps/snakemake/6.12.1/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2461, in run_wrapper
    raise ex
  File "/software/7/apps/snakemake/6.12.1/lib/python3.10/site-packages/snakemake/executors/__init__.py", line 2418, in run_wrapper
    run(
  File "/project/bin/pav_2.2.0/rules/call.snakefile", line 194, in __rule_call_merge_haplotypes_batch
  File "/project/bin/pav_2.2.0/pavlib/call.py", line 262, in merge_haplotypes
    df = svpoplib.svmerge.merge_variants(
  File "/project/bin/pav_2.2.0/dep/svpop/svpoplib/svmerge.py", line 78, in merge_variants
    return merge_variants_nr(
  File "/project/bin/pav_2.2.0/dep/svpop/svpoplib/svmerge.py", line 386, in merge_variants_nr
    df_new['ID'] = svpoplib.variant.version_id(df_new['ID'], set(df['ID']))
  File "/project/bin/pav_2.2.0/dep/svpop/svpoplib/variant.py", line 678, in version_id
    raise RuntimeError(f'Error de-duplicating variant ID field: Split "{name}" on "." and expected to find an integer at the end')
RuntimeError: Error de-duplicating variant ID field: Split "NKLS02000480.1-69028-INS-1" on "." and expected to find an integer at the end
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

It seems to be an input file problem.

paudano commented 1 year ago

This was a bug triggered by having "." in the reference chromosome name, it's not something I tested previously. It is fixed in #8f28877 (v2.2.2), thank you for reporting this. Let me know if it can now run to completion.

YahGao commented 1 year ago

Thanks for the new version. It works well after removing chromosomes including "."

paudano commented 1 year ago

To be clear, the new version corrects the bug, so you can keep chromosomes with "." in them and PAV will process them correctly. I'll close this, but if you still have an issue with "." in chromosome names, please feel free to re-open.