Short read simulate issue

jiadong324 commented 4 years ago

Dear author,

When I run VISOR SHORtS, it raises an error of "bwa is not found..."

But I checked my conda virtual environment and also executed the code for "external tool check" with python activated through terminal. Those tools include bwa, samtools and wgsim are all found not None.

Please help and thanks!

Here I use which to list the path of all these tools:

davidebolo1993 commented 4 years ago

Hi @jiadong324, and thanks for getting in touch.

I can't replicate your problem but we can try to figure it out what's happening. For example, you can check whether your python environment is capable to find bwa. Just enter python (or ipython) and try:

from shutil import which
which('bwa')

Can you please tell me what is happening? Just in case you already did this, can you please copy and paste the complete log from VISOR_SHORtS.log?

Best,

Davide

jiadong324 commented 4 years ago

Hi,

This is what I get as you mentioned:

And here is the log file:

davidebolo1993 commented 4 years ago

Hi,

yes, probably it is something that has to do with the way anaconda links the installed dependencies. It can be, at least. I think it is worthwhile to install a fresh visorenv environment as described here (https://davidebolo1993.github.io/visordoc/installation/installation.html), as it seems to me that you are using an old one (py36). As you have already anaconda installed, you can start from: conda update -y conda conda create -y -n visorenv python=3.6 (Then, just follow the instructions in the page I linked).

Alternatively, you can try the docker version (https://davidebolo1993.github.io/visordoc/installation/installation.html#docker-container).

Let me know !

Best,

Davide

jiadong324 commented 4 years ago

Ok, I will try a new one and let you know if it works.

Thanks!

jiadong324 commented 4 years ago

It works now. I am wondering how to create homo or het SVs. For example, what should I do to create het SVs at ch20?

Thanks!

davidebolo1993 commented 4 years ago

Hi @jiadong324,

Happy that now it's working. Generally speaking, I can suggest to go through the supplementary materials of VISOR's manuscript. They include a detailed description of all your concernings (https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz719/5582674).

I will try to answer them here briefly:

Yes, that's one thing you can do. Otherwise, you can generate 2 different BED with the same alterations specified (this is particularly useful if you want to simulate HOM alterations as well as HET SNPs).
This is more a technical question. As you properly noticed, in order to calculate the correct number of reads to simulate (thus, getting the correct coverage for each haplotype), for each chromosome you want to simulate, you should specify its max dimension. Let's say that you want to simulate chr1 and this chromosome has length L in your reference, length L-100 in haplotype 1 and L+50 in haplotype 2. For this example, you should specify the last dimension in the BED for SHORtS. Indeed, if you do not specify the max dimension but the reference one or the one from haplotype 1 (L or L-100), you will lose a part of chr1 in haplotype 2. At the same time, specifying a dimension greater than the true one for reference/haplotype 1 is not causing issues, as VISOR re-sizes to the max available dimension if it exceeds chromosome boundaries (see also https://github.com/davidebolo1993/VISOR/blob/master/VISOR/SHORtS/SHORtS.py, lines 557-573 for more details). For your specific case, you should use the max dimension between chr1 dimension in reference and chr1 dimension in the unique haplotype you have.
Coverage fluctuations are more likely to be present in whole-exome sequencing data and reflect the so called capture biases. For whole-genome sequencing data, I assume you can ignore this (that is, putting this to 100.0 should be fine). The purity paramenter specifies the percentage of reads that will be simulated from your (unique, if I understood correctly) haplotype for that region/chromosome. Let's say that you simulated a deletion on your chr1. If you specify a purity value of 50.0, half of the reads will be simulated from haplotype 1 and will contain the deletion; the remaining half will be simulated from the reference and won't contain the deletion. This reflects the classical normal in tumor contamination that you see in many cancers. As suggested before, have a look at the Supplementary Material of our mansucript for more details.

I hope it helped.

davidebolo1993 commented 4 years ago

It works now. I am wondering how to create homo or het SVs. For example, what should I do to create het SVs at ch20?

Thanks!

Sorry but I read this too late and already answered to your previous comment. However, if you want to create an HET SV (a deletion, for example) on chr20 you should create 2 haplotypes. You can proceed by generating 2 BED files (one for haplotype 1, containing the deletion, and one for haplotype 2, containing a single SNP) and give these to HACk. This will generate 2 haplotypes (one with the SV) in the output folder. All the procedures are also explained step-by-step here:

https://davidebolo1993.github.io/visordoc/usecases/usecases.html#visor-hack

Best,

Davide

jiadong324 commented 4 years ago

@davidebolo1993 Thanks for your immediate response, really appreciated!

I am working on complex SV callers both for short and long reads, so your simulator really attracts my attention. It dose include some important complex types, but can you also add another common type, which is deletion associated with inversion at flanking regions?

Since I am benchmarking my algorithm, so I may modify some of your code to meet my demands.

Thanks!

davidebolo1993 commented 4 years ago

@jiadong324, no worries. I'm here to help.

You are free to modify the code as you want. I can review it at the end, of course. For the time being, I can suggest to simply generate small inversion flanking the deletion at the HACk stage.

For example (first 4 columns of a HACk.bed file):

chr20 20000000 20001000 inversion ... chr20 20001001 21000000 deletion ... chr20 21000001 210001000 inversion ...

As you may see by using VISOR, it is quite versatile and allows you to hack on simple stuff to get more complex SVs. Be sure entries do not overlap, otherwise they will be skipped.

I'm closing the issue for the time being. Feel free to get in touch with me at any time.

Best,

Davide

davidebolo1993 / VISOR

Short read simulate issue #2