ShoRAH: Illumina reads ; influenza

Dear,

I am very interest in using your tool for my project. I'm working on influenza reads that were sequenced with Illumina MiSeq. I was able to extract quasispecies with LoFreq, however LoFreq doesn't have the ability to construct haplotypes, therefore I turned to this tool. However, I tried running this, but I encounter always an error:

/usr/local/bin/b2w: error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory

I also tried this with the amplicon example that was given, but same error was given. I don't know if you have an idea what could be the cause of this?

Additionally, I also was wondering what parameters would be ideal as I don't find a lot of information. Based on a few papers I think that for my purpose following command should work: shorah shotgun -b sample_sort.bam -f sample.fasta -a 0.01 -w 120 -s 3

However, I don't know if the other parameters should be set or if the default is sufficient?

Kind regards and thanks in advance Laura

Hello Laura,

However, I tried running this, but I encounter always an error:

/usr/local/bin/b2w: error while loading shared libraries: libhts.so.2: cannot open shared object file: No such file or directory

This error message means that when you try running ShoRAH, it can't manage to find one of its dependencies: the library libHTS (from samtools) against which it was compiled. Did you compile libHTS yourself? Is it also installed in /usr/local/lib? Did you run ldconfig to make your system aware of this new library? If you haven't installed libHTS in a standard location, you might look at either LD_LIBRARY_PATH to specify non-standard locations while running or LD_RUN_PATH/-Wl,-rpath to hard-code non-standard locations when compiling.

Alternatively, I would strongly suggest that you consider installing the bioconda package as conda will automatically handle all dependencies for you and make whole installation experience much simpler. (e.g.: for our V-pipe pipeline, in the tutorials we recommand letting snakemake's built-in bioconda support to handle the package installation).

Additionally, I also was wondering what parameters would be ideal as I don't find a lot of information. Based on a few papers I think that for my purpose following command should work: shorah shotgun -b sample_sort.bam -f sample.fasta -a 0.01 -w 120 -s 3

An example of a current run of ShoRAH in our V-pipe pipeline:

/cluster/work/bewi/pangolin/shorah-test/test/bin/shorah shotgun -t 64 -a 0.1 -w 201 -x 100000 -c 0 -r NC_045512.2:1-29903 -R 42 -b REF_aln.bam -f cohort_consensus.fasta

Explanation of options:

-a we're currently using an alpha of 0.1 (which is ShoRAH's default) to control the probability of creating new classes in the dirichlet sampling process. The value of 0.01 that you propose might be helping reduce the false positivies if your sample exhibits low diversity.
- -w you must select a width for the window which is divisible by 3 and is shorter that the read length. In our current experiments, we use 250-long reads, so a window of 201 make sure that enough reads will cover a reasonable fraction of the windows (-m, by default 95% fraction of the window). If your experiment uses reads of length 150, using -w 120 seems reasonable (and is conveniently divisible by 3).
- -s 3 (we omit it as everything relies on the default 3 overlapping windows in our pipeline).
- -x 100000 we don't really limit the maximum coverage (we can afford the extra processing time).
- -c 0 we currently do not limit the minimum coverage, but beware that benchmarking with simulated synthetic data has show lower quality result in windows with less than 50 coverage. You might consider either filtering at a later point (our current approach) or raising the minimum coverage.
- -r can be used to specify a region. More on this later.
- -R can force a random seed if you want more reproducible results.
- -t the upcoming 1.99.1 version will allow limiting the concurrent threads. Consider subscribing to our mailing list to be informed about upcoming releases of both our pipeline (V-pipe) and the component we develop such as the present ShoRAH.

however LoFreq doesn't have the ability to construct haplotypes, therefore I turned to this tool.

Please note that ShoRAH 2 has dropped the support for full global haplotype reconstruction (you would need to use ShoRAH 1 for that). It only reconstructs local haplotype within the windows. The .fasta file of this local haplotypes are available in the sub-directory /support. Example from our current research:

# zcat support/w-NC_045512.2-1274-1474.reads-support.fas.gz
>hap_0|posterior=1 ave_reads=404.707
GTTAAAGCCACTTGCGAATTTTGTGGCACTGAGAATTTGACTAAAGAAGGTGCCACTACTTGTGGTTACTTACCCCAAAATGCTGTTGTTAAAATTTATTGTCCAGC {...}
>hap_1|posterior=0.0254284 ave_reads=0.0254284
GTTAAAGCCACTTGCGAATTTTGTGGCACTGAGAATTTGACTACAGAAGGTGCCACTCCTTGTGGTTACTTACCCCAAAATGCTGTTGTTAAAATTTATTGTCCAGC {...}

For each reconstructed local haplotype you get:

an arbitrary hap_ identifier (the numbers are arbitrary and in rough order of posterior. They do not match across windows)
the posterior probability of this haplotype after sampling. I would suggest filtering and keeping only high probability posteriors (currently, when calling SNVs, ShoRAH only keeps haplotypes with posteriors > .9)
ave_reads the number of reads which contributed to this haplotype during the considered iterations. (you could also consider filtering that)

In the above example, I would definitely trust hap_0 (has a high probability, and is based on a large amount of reads), but not hap_1 (low posterior probability, few reads contributing to it).

You could also have a look at the discussion in issue #66

Last important note: As it turns out, until now we have mostly worked with viruses which have a single continuous genome: HIV, HCV and SARS-CoV-2. I haven't tested how the code reacts in the case of a virus with a segmented genome as the orthomyxoviridae such as Influenza. I would be very interested if you could report any error, so I can try to fix segmented viruses support in b2w (or elsewhere inside ShoRAH 2). In case of errors, a possible circumvention would be to force the processing of specific regions using the -r parameter.

Please keep me informed of your successes.

Thank you for the advice and the help! It seems to be working on my influenza samples so very nice! I installed shorah version 1.99.0 (with bioconda) so this is Shorah 1 right? However, I do not find the global haplotype or should this be another command?

I putthe results from the Shorah for one sample here: https://we.tl/t-1KJDYbFzHA

Thanks !

1.99.0 is our previous pre-release for ShoRAH2

BTW: Yesterday we released ShoRAH2 v1.99.1 which fixes a few minor bugs (posterior values '>1.0', unhandled exceptions from exp underflows), in addition to introducing the above mentionned -t option. I would suggest updating as soon as the bioconda package becomes available in your local mirror.

It doesn't reconstruct the global haplotypes. Only the local haplotypes.

Looking at your files (thank you for providing them), it seems that ShoRAH has crashed for some reasons: The shorah.log files stops abruptly after having processed half of the windows. The local haplotype are the files like w-4-4-WGS-H3N2-80-20-Mix1_MP-1-120.reads-support.fas (but because of the crash, ShoRAH didn't get to the point where it compresses and moves them into the support/ subdirectory yet):

>hap_0|posterior=1.0011 ave_reads=8165.08
GATATTGAAAGATGAGCCTTCTAACCGAGGTCGAAACGTATGTTCTCTCTATCGTTCCATCAGGCCCCCTCAAAGCCGAGATCGCGCAGAGACTTGAAGATGTCTTTGCTGGGAAAAACA
>hap_1|posterior=0.989646 ave_reads=1.99331
GATATTGAAAGATGAGCCTTCTAACCGAGGTCGAAACGTATGTTCTCTCTATCGTTCCATCAGGCCCCCTCAAAGCCGAGATCGCGCAGAGACTTGAAGATGTCTTTGCTGGGAA-----

hap_0 has a high posterior probability (the post >1.0 bug is fixed in ShoRAH2 v1.99.1) and is based by a high average number of reads (8165).
hap_1 and the following are based on a much lower number of average reads (all < 5), I wouldn't trust them.

(see: doi:10.1101/2020.06.09.142919 where we have benchmarked the output of ShoRAH as part of our V-pipe. Section 3.3 of the paper seem to suggest that 10 is a good threshold for local haplotype's average reads).

I would like to investigate the crash - as mentioned in my previous message, this is the first occasion I have to test a segmented virus genome. Would you accept sending me also your reference FASTA and the aligned BAM ?

v1.1.3 seems to be the last ShoRAH1 that still has the mm.py command used for global haplotypes. But given that even the upgraded ShoRAH2 v1.99.0 crashes on your data, I wouldn't hold hopes high that the older ShoRAH1 v1.1.3 would successfully build haplotypes without crashing.

What is the exact question you're trying to answer with ShoRAH?

Do you need exact sequence of each chromosome of each present quasi-specie?
Do you need to asses quasi-specie diversity ?

If it is the later, analysing the local haplotype per windows could give a very good estimate of the diversity present in the sample.

If it is the former, you might also want to explore other options. Our group has also developed Haploclique and QuasiRecomb as other tools which might help you in this direction.

Haploclique is one of the engine for global Haplotype featured in V-pipe. The other being SAVAGE by J. Baaijens et al.

I'm sorry, I thought the run was finished, but this was not the case. I put the actual results again in this link ([https://we.tl/t-t1stc9vhBp]) and I included the bam file and consensus fasta. But I think it has not stopped, I was a bit too quick. This sample is a mix of two reverse genetic viruses, one wild type and one with a mutation in the NA segment (antiviral resistance mutation E119V) that are present in a 80-20 ratio. So this is a bit our test sample as we know what should be in here (there could be some extra mutations due to growing in the cells but this should be limited).

With LoFreq we were able to extract these mutations quite nicely with the "correct" ratios. Hereafter we used the same steps on "real" clinical H3N2 samples, so we already have a quite good idea of the diversity in the samples. In some samples we have a lot of different low frequency variants. So I would like to construct the sequences for the different quasispecies that are present in the sample so that a phylogenetic analysis can be run, to see if there are patients that are actually infected by two different strains and how different the low frequency quasispecies are from the consensus population. So therefore I think we need the global haplotype and not (only) the local haplotype.

I will check then this last version. I'm also exploring other tools like Haploclique and QuasiRecomb, but here it is a bit the same "problem". There are not a lot of papers where they have been used (on influenza viruses) and often there is only limited information on how too choose appropriate parameters. So this makes it quite challenging to select the parameters so currently I'm running them with the default parameters. However, I would suspect that the results could be "better" if the parameters would be adapted.

I have managed to run successfully shorah until the end (in shorah.log : shotgun run ends). You should have recieved an owncloud shared link per mail (if not, here's an auto expiring one ). In addition to the usual files, I've quickly hacked up a graph (see file: report.html inside the zip file, produced by make_local_report.pl).

4-4-WGS-H3N2-80-20-Mix1_NA_report (It works best by selecting one of the discrete colours from the palette drop down)

I have filtered for posterior>.9 and _avereads>10. (Sorry the filter pop-up on the top right isn't interactive).

In the windows 202-402, 269-469 and 336-536, we clearly see a co-occurrence of a haplotypes. Same in windows 269-469, 336-536 and 403-603, we see another co-occurrence at a different proportion mix. (Note: the colours doesn't map between adjacent windows. It's just attributed to differnt hap_nn numbers)

You can check the sequences themselves, e.g. in support/w-4-4-WGS-H3N2-80-20-Mix1_NA-269-469.reads-support.fas.gz:

hap_0|posterior=1 ave_reads=528.745
CGCAATGTGACATTACAGGATTTGCACCTTTTTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACAATGTACGTGATAGGACCCCTTATCGGA
>hap_1|posterior=1 ave_reads=4351.43
CGCAATGTGACATTACAGGATTTGCACCTTTTTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGA
>hap_2|posterior=1 ave_reads=2288.23
CGCAATGTGACATTACAGGATTTGCACCTTTTTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGACATCTGGGTGACAAGAGTACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGA
>hap_3|posterior=1 ave_reads=46.2196
CGCAATGTGACATTACAGGATTTGCACCTTTTTCTAAGGACAATTCGATTAGGCTTTCCGCTGGTGGGGATATCTGGGTGACAAGAGAACCTTATGTGTCATGCGATCCTGACAAGTGTTATCAATTTGCCCTTGGACAGGGAACAACACTAAACAACGTGCATTCAAATAACACAGTACGTGATAGGACCCCTTATCGGA
>hap_4|posterior=1 ave_reads=27.2472

(Note: the hap_nn numbers doesn't map between adjacent windows. Each window is treated independently and numbers are assigned in order of posterior)

You can also check the SNV themselves in the CSV or VCF files (snv/SNVs_0.010000_final.vcf). Here are the variants in the CSV file that were observed in all three overlapping windows mentioned above :

4-4-WGS-H3N2-80-20-Mix1_NA,356,A,T,0.3363,0.3160,0.3675,1.0000,1.0000,1.0000,3083,2474,8607,7265,0.809262,1
4-4-WGS-H3N2-80-20-Mix1_NA,443,C,A,0.0730,0.0706,0.0702,1.0000,1.0000,1.0000,524,504,8971,7377,0.458405,1
4-4-WGS-H3N2-80-20-Mix1_NA,444,A,T,0.0730,0.0706,0.0702,1.0000,1.0000,1.0000,522,505,8859,7126,0.382113,1

This first one is probably the SNV that cause the E119V amino acid substitution you have referred to, and that we see in windows 202-402, 269-469 and 336-536.

We also have developed a visualization for SNV, but we need to adapt it a bit for your use case, we have developed it for SARS-CoV-2 and obviously it doesn't support segmented genomes (yet). So I'll need to hack it a bit before providing you a report there too.

(note: the Pval and Qval in the last column refer to the strand bias test, see the INFO field in the VCF file. Normally, it is expected to fail (=there is no bias) in normal paired end sequencing. Details in doi:10.1186/1471-2164-14-501).

cbg-ethz / shorah

ShoRAH: Illumina reads ; influenza #75