HurlesGroupSanger / indelible

Structural Variation breakpoint discovery via adaptive learning
GNU General Public License v3.0
15 stars 1 forks source link

`ValueError` when running `indelible.py database` #4

Closed marrip closed 3 years ago

marrip commented 3 years ago

hey,

I am running indelible on some patient data and could run the steps fetch, aggregate and score without any problems. However, now I am running database which runs and logs some values until it throws this error:

Traceback (most recent call last):
  File "/indelible/indelible.py", line 144, in <module>
    indelible.build_database(args.fof, args.output_path, args.reference_path, config, args.bwa_thread)
  File "/indelible/indelible/build_database.py", line 142, in build_database
    bwa_parser = BWARunner(final_frame, output_path, fasta, bwa_loc, bwa_threads)
  File "/indelible/indelible/bwa_runner.py", line 44, in __init__
    ref_seq = self.__get_ref_string(row["dir"], row["chrom"], row["pos"])
  File "/indelible/indelible/bwa_runner.py", line 89, in __get_ref_string
    ref_seq = self.__fasta.fetch(reference=chrom, start=left, end=right)
  File "pysam/libcfaidx.pyx", line 290, in pysam.libcfaidx.FastaFile.fetch
  File "pysam/libcutils.pyx", line 252, in pysam.libcutils.parse_region
ValueError: start out of range (-48)

I am not completely sure what is causing the error but I saw that in bwa_runner.py is doing some calculations which might cause the error. Any kind of feedback would be highly appreciated πŸ˜„

eugenegardner commented 3 years ago

I think this may be because I messed up my branches by accident when doing some recent work to update that particular module. So you are basically using a half-deployed version of that code. I am planned to do a merge (very) soon, but in the meantime you can try pulling the sv_caller branch and see if it fixes the problem.

This is also directly related to issue #3 and will hopefully be corrected moving forward.

marrip commented 3 years ago

Haha, you are little too fast. The branch disappeared and now I see you have a version tag. Is v1.1.0 the one you would recommend?

eugenegardner commented 3 years ago

Yeah, realized after I posted that message the development on that branch was pretty much done other than bug fixes...

So I merged it into master and generated a new release. Try v1.1.0 for now and let me know if you run into any bugs. I think v.1.0.0 may be a little more stable (and reflects the release at the time of the medrxiv preprint) so you may choose to revert to that if necessary.

I'll leave this open to track the original issue if necessary...

marrip commented 3 years ago

Hey again,

I tried v1.1.0 running database and encountered this one:

...
bwa ran with code: 0
Traceback (most recent call last):
  File "/indelible/indelible.py", line 146, in <module>
    indelible.build_database(args.fof, args.output_path, args.reference_path, config, args.priors, args.bwa_thread)
  File "/indelible/indelible/build_database.py", line 227, in build_database
    v = add_alignment_information(v, decisions, repeat_info)
  File "/indelible/indelible/build_database.py", line 73, in add_alignment_information
    if int(other_coord[1]) < curr_bp["pos"]:
ValueError: invalid literal for int() with base 10: 'KN707671v1'

Seems like indelible tries to read some string as an integer which cannot be transformed. I did not reproduce the output from the previous steps. Should I have done that? I will try v1.0 and let you know if I get past this point ☺️

Update: With v1.0 the task completes without any errors and completes also much faster than with v1.1.0. Could you tell me why?

marrip commented 3 years ago

Running v1.0, I tried the annotate and got this error:

Traceback (most recent call last):
  File "/indelible/indelible.py", line 164, in <module>
    indelible.annotate(args.input_path, args.output_path, args.database, config)
  File "/indelible/indelible/annotate.py", line 378, in annotate
    bhash = create_blast_hash(input_path)
  File "/indelible/indelible/annotate.py", line 295, in create_blast_hash
    for row in csv.DictReader(open(blast_nonrepeats_path, 'r'), delimiter="\t"):
FileNotFoundError: [Errno 2] No such file or directory: 'indelible/duplicates_marked.bam.counts.scored.fasta.hits_nonrepeats'

Indeed this file doesn't exist - which task does produce this?

eugenegardner commented 3 years ago

I think you may be using the README for v1.1.0 to run v1.0.0. There is an additional module that has been removed ("Blast") whose functionality is now within "Database" in v1.1.0.

As for the issue in v1.1.0 – are you just running one sample? If so, could you share your output for the score command (something like *.bam.counts.scored if run similar to the README)?

marrip commented 3 years ago

Hey,

I did indeed use the README for v1.1.0 - thought it could be rather similar πŸ˜… . But now I took a look and tried to run the blast command, however, I need a windowmasker db and I had some troubles running windowmasker myself due to some dependencies. I am using the reference genome GRCh38. I will try to fix it next week and see if I can get further with the analysis.

Regarding sharing the files, I am a bit hesitant as this is actual patient data. I took a look at the .scored file though and found that the string KN707671v1 stems from the chromosome ID chrUn_KN707671v1_decoy. Should these be excluded from the reference before analysis?

eugenegardner commented 3 years ago

Yeah, I would probably try to stick with the current version so that it is easier for me to support. windowmasker is annoying to run, and I couldn't include it in the repo due to it's size...

As for sharing, it doesn't necessarily have to be public here on github. The file should be reasonably small and if you strip patient identifiers from it we can arrange a secure share using open ssl, something like:

echo "PASSWORD" | openssl enc -md MD5 -pass stdin -aes-256-cbc -salt -in sample.scored > sample.enc

Obviously don't use "PASSWORD"... But you can send to be with a private message/email (in the readme).

Up to you but would help me to debug the issue. The chromosome column should not get fed into that particular line in the code, so there is likely a more complicated formatting thing going on here that is going to be difficult for me to diagnose without the original file. I think even a slice (like 10 lines above and below) would be sufficient.

marrip commented 3 years ago

Hey, I created a slice including header and additional lines before and after the offending line. Should I send it to your mail address?

eugenegardner commented 3 years ago

Yup!

eugenegardner commented 3 years ago

I think I've figured out the issue. These "chromosomes" contain "_" characters in their names and I code some of the data with "_" as delimiters which I split on (don't know why I realized just now that this is a bad idea.....).

I have pushed a fix live, so pull the latest master branch and let me know if it fixes your issues. I will incorporate this fix into a later full release at some point.

Thanks for your patience!

marrip commented 3 years ago

Hey Eugene,

now I was able to run all the steps until annotate. Then I got this error:

Traceback (most recent call last):
  File "/indelible/indelible.py", line 155, in <module>
    indelible.annotate(args.input_path, args.output_path, args.database, config)
  File "/indelible/indelible/annotate.py", line 224, in annotate
    v = attach_db(v, db)
  File "/indelible/indelible/annotate.py", line 119, in attach_db
    db_entry = db[key]
KeyError: 'chr1:104160'

I saw that you changed the separators for chromosomes and positions so I guess it has something to do with that?

eugenegardner commented 3 years ago

I think this is a separate issue. I was able to run InDelible without issue using my own tests. I think this has to do with the 'chr' leader on the entry. I have stripped it pretty much everywhere else, but I seem to have missed it here.

I have pushed a bug fix – go ahead and test it for me. Again, sorry for the issues, but these issues seem to only show up when other people start using it...

marrip commented 3 years ago

There seems to be a problem still:

Traceback (most recent call last):
  File "/indelible/indelible.py", line 155, in <module>
    indelible.annotate(args.input_path, args.output_path, args.database, config)
  File "/indelible/indelible/annotate.py", line 224, in annotate
    v = attach_db(v, db)
  File "/indelible/indelible/annotate.py", line 119, in attach_db
    db_entry = db[key]
KeyError: '1:104160'
eugenegardner commented 3 years ago

I have run InDelible on my own Hg38 test data and cannot replicate your issue. Can you share your database and annotate command?

Are you potentially using the "database" file included with InDelible rather than the one that database creates?

Also, can you check if that the entry "1:104160" actually exists in your database file that you generated with 'database'?

marrip commented 3 years ago

I think I do πŸ˜… I run this:

indelible.py annotate --i sample.bam.counts.scored --o sample.bam.counts.scored.annotated --c config.yml --d Indelible_db_10k.hg38.bed

and Indelible_db_10k.hg38.bed is from the the included zipfile. Indeed, the position exists in my tsv database file. I think I got confused by the format difference tsv vs. bed. Is there some kind of shortcut to generate a bed from the database?

eugenegardner commented 3 years ago

I think based on your experience I need to make the README more clear and make sure file names are consistent (file suffixes don't matter here...). There are two options:

  1. Run InDelible and generate a database using ONLY genomes from your dataset to annotate with (not recommended):
indelible.py database --f fofn.txt --o db.txt ... <other options>
indelible.py annotate --i file.scored --d db.txt ... <other options>
  1. Run InDelible and generate a database using BOTH genomes from your dataset and from the priors included with InDelible (Recommended):
indelible.py database --f fofn.txt --o db.txt --priors Indelible_db_10k.hg38.bed ... <other options>
indelible.py annotate --i file.scored --d db.txt ... <other options>

Hope that clears things up? I have fixed the README to hopefully be more accurate of the above workflow.

marrip commented 3 years ago

Much better indeed πŸ˜‰ I will try if I get it right and let you know. Thanks for being patient and responding quickly.

Have a great weekend.

eugenegardner commented 3 years ago

No worries. These are mistakes I've made so thanks for catching them for me. Let me know how it goes and I will close this issue.

marrip commented 3 years ago

Hey Eugene,

every step ran smoothly now and I get a tsv in the end, now I am trying to do the complete run. I was also wondering about a vcf as output, is there any method to produce that?

eugenegardner commented 3 years ago

Closing this issue since the original issue seems to have been fixed.

See #6 for new responses.