airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Coordinate with IgBlast to support AIRR germline set format #599

Closed schristley closed 1 year ago

schristley commented 2 years ago

I believe @jianye00 has mentioned in the past that IgBlast would be willing to consider using an AIRR format for germline sets as an input instead of its internal tables. Now that we have an experimental format, it would be worthwhile to start the discussion to see what IgBlast's requirements are, and whether there are any gaps with what's provided in the AIRR format.

This could provide immediate benefits to users as potentially many non-model organism could be annotated using IgBlast as well as open up downstream analysis tools.

jianye00 commented 2 years ago

Hi Scott,

Are you talking about the sub-region boundaries like FWR1, CDR1 for germline V and J genes? We essentially need the start and stop positions. Yes this would complement the missing organisms for IgBlast and possibly holes even in some supported organisms.

schristley commented 2 years ago

Are you talking about the sub-region boundaries like FWR1, CDR1 for germline V and J genes? We essentially need the start and stop positions. Yes this would complement the missing organisms for IgBlast and possibly holes even in some supported organisms.

Yes, those should be part of the annotation, like these in SequenceDelineationV. I'm not sure what's available for J genes, but we should update the spec if something is missing.

        - fwr1_start
        - fwr1_end
        - cdr1_start
        - cdr1_end
        - fwr2_start
        - fwr2_end
        - cdr2_start
        - cdr2_end
        - fwr3_start
        - fwr3_end
        - cdr3_start

Are the NCBI tools able to read YAML/JSON files? The structure of the data can get a little complicated because there are a number of different objects nested together.

I'll include @williamdlees as he can help answer specifics about how coordinates are specified, in comparison to how IgBlast might need them.

jianye00 commented 2 years ago

IgBlast does not r read Json format but I saw other tools here use it. Point me to your data and I'll take a look.

schristley commented 2 years ago

OGRDB germline sets. There is an AIRR (JSON) link to download.

schristley commented 2 years ago

There are a number of C++ JSON libraries, e.g., Boost has a JSON parser.

williamdlees commented 2 years ago

There's an explanation of the schema here. For V-genes, the important fields are coding_sequence, which is the IMGT-gapped sequence, and v_gene_deliniation. For J genes we provide coding_sequence and j_codon_frame.

scharch commented 1 year ago

@jianye00 was this added in 1.20.0? Release notes are a little cryptic...

jianye00 commented 1 year ago

Yes, it is supported already in 1.20.0. There is a python script makeogrannote.py in the bin folder of the release package. You can use that to produce your custom annotation file out of AIRR JSON file. The you just feed it to igblast (see "Procedure to use custom FWR/CDR annotation" in https://ncbi.github.io/igblast/cook/How-to-set-up.html for more details).

williamdlees commented 1 year ago

Hi Jian,

There are some potentially breaking changes in the JSON format which you should be aware of, although it looks to me as though the scripts will run ok. The changes are detailed here: https://wordpress.vdjbase.org/index.php/ogrdb_news/germline-set-format-updates/. The key change is that coding_sequence is no longer gapped – although there is an IMGT gapped sequence in the v_gene delineation, if the scheme is IMGT. These changes are live now.

makeogrdb.py uses coding_sequence, but as it strips the dots out I think it should run ok.

Likewise makeogrannote.py makes calls to GetDotCounts which aren’t needed any longer, but they shouldn’t do any harm.

All the best

William

javh edit: remove email headers

jianye00 commented 1 year ago

Thanks William.  We'll check to make sure it works.

javh edit: remove email headers