JamieHeather / stitchr

Stitchr - a Python script to stitch together coding TCR nucleotide sequences from V, J, and CDR3 info
https://jamieheather.github.io/stitchr/
MIT License
50 stars 10 forks source link

A issue about stitching immunoglobulins #43

Closed DawnChou closed 4 months ago

DawnChou commented 4 months ago

Hi, I am trying to apply stitchr to BCR, but I met some problems.

Following the instruction, I tried using this modified version of stitchrdl to download formatted human immunoglobulin sequences from IMGT. However, it didn't create IGH.fasta IGK.fasta IGL.fasta under 'HUMAN-IG' data folder due to the warnings:

1716358867528

I don't know why it occurred, and I revised the code of this modified version of stitchrdl in line 304 from if len(region_counts[loc]) >= 4: to if len(region_counts[loc]) >= 2: (I am not sure whether this is correct but I just want the code to create IGH.fasta IGK.fasta IGL.fasta under 'HUMAN-IG' data folder. Well, the modify did create the IGH.fasta IGK.fasta IGL.fasta under 'HUMAN-IG' data folder. However, when I run the example stitchr -v IGHV3-30-3*01 -j IGHJ4*02 -cdr3 CARLSPAGGFFDYW -c IGHM*01 -s HUMAN-IG -n JQ304252, it raises errors: Exception: No entries for CONSTANT in IMGT data. Now there is no way for me to run stitchr on BCR successfully. Is there any method to fix this? Thanks.

JamieHeather commented 4 months ago

Hello hello,

So unfortunately this is one of those consequences of my being a TCR person who doesn't really do much BCR stuff, so little problems don't get caught early!

In this case this was due to the fact that IG constant region sequences are scraped in a separate command, and the URL I copied for the API from their website only had 'http' not 'https'. I guess since I first wrote and tested that script IMGT tweaked their settings so that http scrapes were rejected, so this script wasn't catching any constant regions and so none of the loci were stitchable.

I've fixed the gist now (v0.2.0), and confirmed that works at least on my setup. Please reopen and let me know if that doesn't get it working for you.

Also all those "Skipping gene sequence" warnings are unrelated - that's just stitchr avoiding alleles with ambiguous bases. As mentioned in the user guide, everything is trickier in BCRs than TCRs, so the best thing to do if you can is generate donor-specific germline references (potentially also with hypermutated entries) as a starting point, rather than what's in IMGT, unless you're wanting to make naive/IGM sequences.

(Incidentally if you hadn't already seen you may be interested in the AIRR-C's recently released germline IG sets for humans and mice, available from OGRDB.)

Cheers, Jamie