RabadanLab / arcasHLA

Fast and accurate in silico inference of HLA genotypes from RNA-seq
GNU General Public License v3.0
114 stars 49 forks source link

Cannot reproduce test case #51

Closed lkuchenb closed 3 years ago

lkuchenb commented 3 years ago

I cannot reproduce the test case as described in README.md. Here's how I've run arcas HLA:

# Install requirements
conda create -n arcas-hla-deps coreutils 'bedtools>=2.27.1' biopython git-lfs 'kallisto>=0.44.0' numpy pandas  pigz 'python>=3.6.1' 'samtools>=1.9' scipy
conda activate arcas-hla-deps

# Get latest release
curl -L https://github.com/RabadanLab/arcasHLA/archive/v0.2.0.tar.gz | tar zx

# Obtain reference version required for tests
./arcasHLA-0.2.0/arcasHLA reference --version 3.24.0

# Extract reads
./arcasHLA-0.2.0/arcasHLA extract arcasHLA-0.2.0/test/test.bam -o arcasHLA-0.2.0/test/output --paired -t 8 -v

# Genotyping
./arcasHLA-0.2.0/arcasHLA genotype arcasHLA-0.2.0/test/output/test.extracted.1.fq.gz arcasHLA-0.2.0/test/output/test.extracted.2.fq.gz -g A,B,C,DPB1,DQB1,DQA1,DRB1 -o arcasHLA-0.2.0/test/output -t 8 -v

This is the output I get:

{
        "A": ["A*03:01:01", "A*01:01:01"],
        "B": ["B*39:39:01", "B*07:02:01"],
        "C": ["C*01:02:01", "C*08:01:01"],
        "DPB1": ["DPB1*02:01:02", "DPB1*14:01:01"],
        "DQA1": ["DQA1*05:03:01", "DQA1*02:01:01"],
        "DQB1": ["DQB1*06:04:01", "DQB1*02:02:01"],
        "DRB1": ["DRB1*03:02:01", "DRB1*10:01:01"]
}

which mismatches on DQB1 and DRB1.

I also tested this on the master branch.

wgmao commented 3 years ago

I have the exact same output which is not the same as expected out as well.

tpereachamblee commented 3 years ago

Thank you for using our tool. We have confirmed that this was, in fact, an issue with our README which has been updated to accurately reflect the output from arcasHLA genotype when running the test case with the latest commit (8096c18). The original output that had been included in the README was from a very early private release of arcasHLA and was not updated for the subsequent public releases of the codebase which should all, at the time of writing this, produce consistent genotyping results.

tpereachamblee commented 3 years ago

Thank you all for using our tool and pointing out this important issue. I was originally mistaken (see my previous comment, with the text that has been struck through) and will try to clear up any confusion here:

The genotyping calls from releases v0.1.0 and v0.2.0 are presently inaccurate (hence failing to reproduce the proper output for the test case) due to an error resulting from a depreciated call to "git lfs" which has been addressed in the master branch by commit a7ce1bb.

Moreover, the most recent update to the README, commit eeefc2f, undoes the changes that I mentioned in my previous comment from commit 8096c18 (which, incidentally, had been misapplied to the output of the partial command rather than that of the genotype command).

We will release a new, tagged version of arcasHLA with updated requirements and various bug fixes soon.

lkuchenb commented 3 years ago

@IoanFilip2 Is the new release underway? It would be great keep tracking this so that people using the software in their downstream piplines (such as myself) know when to update the tool.

Thanks for providing this software!