Different versions have different results

zhangyafeng1 commented 4 years ago

hi, when i use version v1.3.1 , the HLA I got were HLA-A01:01, HLA-A66:01, HLA-A26:01, HLA-B51:01, HLA-B51:02, HLA-B78:01, HLA-C14:02, HLA-C03:04

Same data ， when i use version v1.4 , the HLA I got were HLA-A01:01,HLA-A25:01,HLA-B51:01,HLA-B78:01,HLA-C14:06,HLA-C03:11,HLA-C*03:04

As you can see，it is have some different , Is this caused by the version update ？ I hope you can confirm it so that I can find out the reason. Thank you very much ！

warrenlr commented 4 years ago

thank you for your message and interest in HLAminer.

HLAminer is sensitive to the content of the HLA sequence databases. With more HLA sequences in the database, the predictions MAY* change, which is what you see for some alleles. This is because new alleles are being discovered, which may be closer in sequence to your reads/assembly contigs (see table below).

*That said, I confirm that the predictions provided in the "test-demo" is the same for the "HPTASRrnaseq_classI.sh" and "HPRArnaseq_classI.sh" pipelines between v1.3.1 and v1.4.

This is the size difference in terms of the number of sequences, in the various databases provided in the database folder:

Database	v1.3.1	v1.4
HLA-I_II_CDS.fasta	14,183	21,206
HLA-I_II_GEN.fasta	709	4,412
HLA_ABC_CDS.fasta	9,308	13,478
HLA_ABC_EX23.fasta	4,602	4,602
HLA_ABC_GEN.fasta	531	3,618

I also often see less variability in the predictions provided by HPTASR pipelines, since these tend to be more sensitive (in mapping sequence contigs instead of read pairs). The pipeline takes longer to run, but it is more robust in my opinion.

The last update was in October 2018 so you may want to update the databases in v1.4 by running: updateAll.sh from the database folder. This will fetch the new HLA sequences from public repos and rebuild the blast and bwa databases/index.

One thing you could do is replace the v1.3.1 database folder by that of v1.4 and re-run. Results should be same/very close since no changes were made to the hlaminer logic between these versions, only support for long read was added and . Of course, it is difficult to know for sure without seeing a log, knowing which pipeline you used (HPTASR or HPRA), the type or reads, the confidence score from hlaminer, etc.

zhangyafeng1 commented 4 years ago

Thank you for your answer ! I have a better understanding of software。

I used HPTASRwgs_classI.sh ; the file in patient.fof is a bam file ; parameters are all default parameters.

The software will generate some contigs during the running process . When I run twice （same data ,same version v1.3.1） , the HLA result also have a little different.，is it related to the difference between contigs assembled for each run?

warrenlr commented 4 years ago

hmm that is strange. It's not a stochastic process, so you should not get different contig sets, and as for blast -- the same applies.

is it related to the difference between contigs assembled for each run?

Do a "diff" on the TASR contigs after each run to see if they are different. They should not be.

kvaldez commented 1 year ago

Hi! Has the above issue been resolved, or perhaps a normal feature of HLAminer? I'm running version 1.3.1 and also getting different results each time.

I used HPTASRwgs_classI.sh ; the files in patient.fof are fastq R1 and R2 files ; parameters are all default parameters.

When I do a diff on the TASR contigs after each run, they are also different.

Thanks, Kristin

warrenlr commented 1 year ago

Hi Kristin, thank you for your report.

I just looked into it using v1.4 (most recent release) and with the test data provided.

I ran "./HPTASRrnaseq_classI.sh" in triplicates, from the test-demo folder.

I put the results here:

https://www.bcgsc.ca/downloads/btl/hlaminer/deterministicInvestigation23FEB2023.tar.gz

The results are in run1, run2, run3.

As far as I can tell, the TASR contigs are the same after each run, using the test RNA-seq data.

The predictions and prediction scores are also identical. HLA alleles predicted with the same score are not output in the same order, but they are not different HLA predictions across all 3 runs.

It appears that, at least for HPTASRrnaseq_classI.sh and with the test data provided (and for v1.4), the process is deterministic. I can't see how using the WGS pipeline/read data would be non-deterministic with v1.3.1 (unless the HLA database was updated between different runs, as I mentioned above).

kvaldez commented 1 year ago

Hi, thank you for the quick response. I'll do some test runs with HPTASRrnaseq_classI.sh (v1.4) and post an update next week.

Kristin

kvaldez commented 1 year ago

Hi again, I gave v1.4 a try and ran HPTASRrnaseq_classI.sh three times from the test-demo folder, however I received different results each time, as well as different contigs.

I included the csv files here. Let me know if there's anything I should be doing differently, I only altered the original file to add #!/bin/bash at the top and perl before the perl commands. I'm also happy to share the intermediate files if that helps. HLAminer_HPTASR_run3.csv HLAminer_HPTASR_run2.csv HLAminer_HPTASR_run1.csv

Thanks again, Kristin

warrenlr commented 1 year ago

I can't wrap my head around this. Especially since it is deterministic on my end. The different contigs is what is most puzzling in your case, and likely the source of your different prediction results. It's almost as if files from the first run influence each subsequent runs; The reason I say that is because your results in HLAminer_HPTASR_run1.csv are exactly the same as the expected predictions in HLAminer_HPTASR_test.csv, found in the distribution.

My only suggestion would be to delete all intermediate files between each runs and troubleshoot, looking at file change. If your contig sequences are different after each run, perhaps run TASR in isolation from the shell script? Have rd1.fq and rd2.fq somehow changed after each run? How about ../database/HLA_ABC_CDS.fasta ? if the TASR assembly run parameters are the same at each run, only the aforementioned input files would influence the output of TASR.

kvaldez commented 1 year ago

I actually moved the intermediate files into separate directories before rerunning. The predictions from HLAminer_HPTASR_run1.csv are the same as HLAminer_HPTASR_test.csv, however the scores are different, so I think this may be a coincidence.

None of the reference files or fq files changed, but I can try running TASR in isolation to see what happens.

kvaldez commented 1 year ago

Sure enough my TASR contigs are very different. I moved intermediate files before reruns, time stamps on all data and resource files are unchanged.

I took the command directly from HPTASRrnaseq_classI.sh and attached contig files here. TASRhla_run3.contigs.txt TASRhla_run2.contigs.txt TASRhla_run1.contigs.txt

warrenlr commented 1 year ago

what if you clone the repo fresh and make 3 copies of the test-demo and re-run in each individual directory?

kvaldez commented 1 year ago

In this case, the resulting predictions are the same but the scores are slightly different:

HLAminer_HPTASR_run3.csv HLAminer_HPTASR_run2.csv HLAminer_HPTASR_run1.csv

warrenlr commented 1 year ago

Interesting.

So, on your end, re-running TASR in the same directory produces a different output, but running it in different directories yields a consistent contig output. This warrants further tests, 1) on the TASR side of things when re-running multiple times in the same directory -- though I can't see what may interfere with the process and 2) I am not sure why the scores vary, because they should not.

I re-tested everything on my end, on a different server than last week and I still do not see the non-deterministic behaviour that you observe. When running multiple times in the same directory, contig sequences are the same and predictions/prediction scores are also the same; the prediction order may vary slightly for HLA alleles having the same prediction score.

This is a puzzle, @kvaldez

kvaldez commented 1 year ago

Agreed it is confusing, thanks for your patience and suggestions!

bcgsc / HLAminer

Different versions have different results #8