cfe-lab / proviral

0 stars 0 forks source link

Re-write HIVSeqinR in python #13

Closed CBeelen closed 4 months ago

CBeelen commented 1 year ago

Now that we are automatically launching the proviral pipeline as part of MiCall, it would make sense to re-write HIVSeqinR in python, or at least re-structure the existing R code. This would make the code a lot more maintainable for us, and would allow us to diagnose bugs more easily. Since the launch of the pipeline, we have encountered 2 errors that led to the pipeline failing, see #12.

Alternatively, re-visit replacing HIVSeqinR by another tool, as outlined in #10.

shengquan-Tang commented 4 months ago

Sorry to disturb you. Has the python version been completed?Thank you.

donkirkby commented 4 months ago

We're switching to HIVIntact instead, @shengquan-Tang. It's written in Python, and we've contributed a couple of pull requests.

shengquan-Tang commented 4 months ago

We're switching to HIVIntact instead, @shengquan-Tang. It's written in Python, and we've contributed a couple of pull requests.

Get it. Thank you for your reply. By the way, HIVIntact can distinguish intact and defective sequcences, but can't distinguish, such as, "LargeDeletion, nonHIV". Is it right? Thanks again!

donkirkby commented 4 months ago

One of our pull requests adds the large deletion error, @shengquan-Tang. If you need that feature, you could use our fork of the project.

shengquan-Tang commented 4 months ago

I truly appreciate your assistance. Best regards.

Donaim commented 4 months ago

Hi @shengquan-Tang, also here is the repository of our HIVIntact fork: https://github.com/cfe-lab/HIVIntact

We have added almost all previous error codes (such as LongDeletion, NonHIV) into our version of HIVIntact, so that it is more compatible with HIVSeqinR. It has some new ones, too.

But there are still many differences between the two programs. If you are going to use it, take a look at our documentation page https://cfe-lab.github.io/HIVIntact/

shengquan-Tang commented 4 months ago

Hi, @Donaim. Your response is really significant and thrilling to me. I'll learn how to use it. I appreciate your assistance.

donkirkby commented 4 months ago

Closing this issue, because we're switching to HIVIntact.

Donaim commented 4 months ago

Superseeded by https://github.com/cfe-lab/proviral/issues/10

shengquan-Tang commented 2 months ago

Hi, @donkirkby. Today, I intended to utilize HIVIntact, however, I encountered difficulties in accessing it. It remains unclear whether your have altered the pathway or discontinued its availability due to other reasons. Thanks for your help! image

Donaim commented 2 months ago

Hi @shengquan-Tang. The URL you linked to, is to the older, original version of HIVIntact. Please use the following repository: https://github.com/cfe-lab/CFEIntact . We will base our proviral pipeline on it.

So the command should be git clone https://github.com/cfe-lab/CFEIntact.

Pathways have changed, sorry for inconveniences caused by that.

shengquan-Tang commented 2 months ago

Thank you @Donaim. Yes, I have also tried https://github.com/cfe-lab/CFEIntact, it can be successfully installed. But it will show some errors whether it's my data or your test data. I don't know what went wrong. image

image

image

Donaim commented 2 months ago

Hi @shengquan-Tang . Sorry for this error, it's due to a difference in python version. I have fixed it today.

Please clone the repository again, to get the newest version of CFEIntact (v1.18.2), and then install it again.

Thank you for letting us know about this issue!

shengquan-Tang commented 2 months ago

Thanks @Donaim, I will try again!

shengquan-Tang commented 2 months ago

Hi, @Donaim. I ran the command "cfeintact check --subtype B test.fasta" which includes the HXB2 (K03455.1) reference sequence. However, I noticed that CFEIntact was unable to recognize it as an intact sequence. The HXB2 sequence was categorized under "error.csv" and had issues like "FrameshiftInOrf" and "DeletionInOrf" (as shown in the first picture). Interestingly, despite having 9719 matches (as seen in the second picture), this outcome seems quite unreasonable. Moreover, when using CFEIntact with my sequencing data containing over ten thousand sequences, it failed to identify even a single intact sequence. In contrast, HIVSeqinR successfully identified over one thousand intact sequences with the same data. Importantly, if CFEIntact cannot recognize intact HIV sequences accurately, I am concerned about whether other functions, such as ORFs analysis and check mentioned on https://cfe-lab.github.io/CFEIntact/workflow.html, can be relied upon for correct identification of HIV sequences. I apologize for any inconvenience caused by these questions; however, your response is incredibly valuable to me and will assist us in identifying HIV sequencing data with greater ease and efficiency.

image

image

Donaim commented 2 months ago

Hi @shengquan-Tang !

Thank you for your feedback. I have already made improvements to CFEIntact based on this conversation, and will continue to do so. We hope you can benefit from CFEIntact as much as we do.

Addressing the issues you've brought up:

HXB2 intactness

The orignal HXB2 sequence is indeed defective. One of the defects was reported by CFEIntact, as shown on your first screenshot. That defect, and all the other ones, are documented here: https://www.hiv.lanl.gov/content/sequence/HIV/REVIEWS/HXB2.html All of them are either a SNP or indel mutations, which is why the number of matches (9719) is so high. Nevertheless, these defects are severe, and are likely to prevent the virus from replicating, so CFEIntact makes the right call here.

When CFEIntact performs it analysis, it relies on a modified version of HXB2 that does not contain any known defects. We intentionally edited the original sequence when we have discovered that they sometimes influence subsequent analysis.

No intact sequences detected

CFEIntact's default configuration is rather strict. All the checks that CFEIntact can perform, are performed by default, and if any of those check result in an error, it marks the input sequence defective. One particular check that can cause many false positives is the sequence divergence check. It's the only one that we disable for our proviral pipeline, and is also what we recommend doing in the documentation. Please see if disabling that option produces less surprising results.

Reliability

Let me present some points that made us trust CFEIntact.

We've done multiple rounds of validations as we transitioned from HIVSeqinR. One of the main validation steps was manual comparison, involving a human reviewer, of over 1000 subtype B sequences. In conclusion, CFEIntact compared overwhelmingly positively compared to HIVSeqinR. This means that CFEIntact was almost always giving the answer of our human reviewer, except for the cases where we agreed that the final decision is not obtainable with the available algorithms. We have not found a single case where HIVSeqinR has made an obviously better judgment than CFEIntact.

CFEIntact is reliable and stable. Its calls are backed by evidence-based studies. It follows software best practices, including static analysis, continuous integration, high test coverage, and is written in a safe and well-established programming language Python.

Still, if you find obviously wrong calls (like a discrepancy between the error code and the description in the documentation), do inform us and we'd be happy to investigate further.

More issues

As I have already mentioned, we benefit from user feedback. And we believe it can likewise benefit other CFEIntact users who may face similar issues in the future. To that end, could you please submit any future queries or issues directly to our CFEIntact's issues page? This way, other users can also easily find them.

shengquan-Tang commented 1 month ago

Thanks @Donaim . The answer you provided is extremely valuable, and I am reviewing the information you mentioned. I will submit queries or issues directly to CFEIntact's issues page if I have further puzzled.