airr-community / airr-standards

AIRR Community Data Standards
https://docs.airr-community.org
Creative Commons Attribution 4.0 International
35 stars 23 forks source link

Original fasta header in IgBlast airr output #650

Closed dudzicp closed 1 year ago

dudzicp commented 1 year ago

Hello, I am not sure this is the correct place but I am struggling with adding original fasta header to airr output of the igblast.

Given input file:

>valid_1 with some header value
tcaagatgaaaatggttcctttttgttgtatatcgacagacacttggttcatgaagtcacctctccacaagctatggtttacactccatccaagggtccaagaactctttacgataaggtttttgatgcacatgttgtccatcaagatgaaaatgCAGGTGCAGCTGCAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCATCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGGCCAAGCGACGAGGCTATTACGATTTTTGGAGTGGTCACAACCCGGGCGTGGGGCCCTCGGGGCCAACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA

The airr format produces the following:

sequence_id sequence    locus   stop_codon  vj_in_frame v_frameshift    productive  rev_comp    complete_vdj    v_call  d_call  j_call  c_call  sequence_alignment  germline_alignment  sequence_alignment_aa   germline_alignment_aa   v_alignment_start   v_alignment_end d_alignment_start   d_alignment_end j_alignment_start   j_alignment_end c_alignment_start   c_alignment_end v_sequence_alignment    v_sequence_alignment_aa v_germline_alignment    v_germline_alignment_aa d_sequence_alignment    d_sequence_alignment_aa d_germline_alignment    d_germline_alignment_aa j_sequence_alignment    j_sequence_alignment_aa j_germline_alignment    j_germline_alignment_aa c_sequence_alignment    c_sequence_alignment_aa c_germline_alignment    c_germline_alignment_aa fwr1    fwr1_aa cdr1    cdr1_aa fwr2    fwr2_aa cdr2    cdr2_aa fwr3    fwr3_aa fwr4    fwr4_aa cdr3    cdr3_aa junction    junction_length junction_aa junction_aa_length  v_score d_score j_score c_score v_cigar d_cigar j_cigar c_cigar v_support   d_support   j_support   c_support   v_identity  d_identity  j_identity  c_identity  v_sequence_start    v_sequence_end  v_germline_start    v_germline_end  d_sequence_start    d_sequence_end  d_germline_start    d_germline_end  j_sequence_start    j_sequence_end  j_germline_start    j_germline_end  c_sequence_start    c_sequence_end  c_germline_start    c_germline_end  fwr1_start  fwr1_end    cdr1_start  cdr1_end    fwr2_start  fwr2_end    cdr2_start  cdr2_end    fwr3_start  fwr3_end    fwr4_start  fwr4_end    cdr3_start  cdr3_end    np1 np1_length  np2 np2_length
valid_1 TCAAGATGAAAATGGTTCCTTTTTGTTGTATATCGACAGACACTTGGTTCATGAAGTCACCTCTCCACAAGCTATGGTTTACACTCCATCCAAGGGTCCAAGAACTCTTTACGATAAGGTTTTTGATGCACATGTTGTCCATCAAGATGAAAATGCAGGTGCAGCTGCAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCATCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGGCCAAGCGACGAGGCTATTACGATTTTTGGAGTGGTCACAACCCGGGCGTGGGGCCCTCGGGGCCAACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA   IGH F               F   F   gnl|BL_ORD_ID|193   gnl|BL_ORD_ID|17    gnl|BL_ORD_ID|21        CAGGTGCAGCTGCAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCATCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGGCCAAGCGACGAGGCTATTACGATTTTTGGAGTGGTCACAACCCGGGCGTGGGGCCCTCGGGGCCAACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA  CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGGNNNNNNNNNNNNNNAGGTGCAGCTGGTGCAGTCTGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTCCAGCTGGTGCAATCTGGGGCTGAGGTGAAGAAGCCTGGGTC  QVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSINYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARGQATRLLRFLEWSQPGRGALGANFDYWGQGTLVTVSS  QVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARGXXXXXVQLVQSXXXXXXXXXXXPAGAIWG*GEEAWV  1   293 308 328 359 402         CAGGTGCAGCTGCAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCATCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGG   QVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSINYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARG  CAGGTGCAGCTACAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTATGGTGGGTCCTTCAGTGGTTACTACTGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAAATCAATCATAGTGGAAGCACCAACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGTGCGAGAGG   QVQLQQWGAGLLKPSETLSLTCAVYGGSFSGYYWSWIRQPPGKGLEWIGEINHSGSTNYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYCARG  TATTACGATTTTTGGAGTGGT   LRFLEW  AGGTGCAGCTGGTGCAGTCTG   VQLVQS  ACTTTGACTACTGGGGCCAGGGAACCCTGGTCACCGTCTCCTCA    FDYWGQGTLVTVSS  GTCCAGCTGGTGCAATCTGGGGCTGAGGTGAAGAAGCCTGGGTC    PAGAIWG*GEEAWV                  CAGGTGCAGCTGCAGCAGTGGGGCGCAGGACTGTTGAAGCCTTCGGAGACCCTGTCCCTCACCTGCGCTGTCTAT QVQLQQWGAGLLKPSETLSLTCAVY   GGTGGGTCCTTCAGTGGTTACTAC    GGSFSGYY    TGGAGCTGGATCCGCCAGCCCCCAGGGAAGGGGCTGGAGTGGATTGGGGAA WSWIRQPPGKGLEWIGE   ATCAATCATAGTGGAAGCATC   INHSGSI AACTACAACCCGTCCCTCAAGAGTCGAGTCACCATATCAGTAGACACGTCCAAGAACCAGTTCTCCCTGAAGCTGAGCTCTGTGACCGCCGCGGACACGGCTGTGTATTACTGT  NYNPSLKSRVTISVDTSKNQFSLKLSSVTAADTAVYYC                                  452.130 41.064  85.286      155S293M109S    462S1N21M74S274N    513S3N44M247N       4.371e-129  1.085e-07   1.150e-20       99.317  38.095  20.455      156 448 1   293 463 483 2   22  514 557 4   47                  156 230 231 254 255 305 306 326 327 440                 CCAAGCGACGAGGC  14  CACAACCCGGGCGTGGGGCCCTCGGGGCCA  30

And the default output contains the header:

Database: Human_IGV.txt; Human_IGD.txt; Human_IGJ.txt;
ncbi_human_c_genes.fasta
           473 sequences; 130,550 total letters

Query= valid_1 with some header value

Length=557
                                                                                                      Score     E
Sequences producing significant alignments:                                                          (Bits)  Value

IGHV4-34*01                                                                                           452     4e-129
IGHV1-46*03                                                                                           41.1    1e-07 
IGHV1-69*02                                                                                           85.3    1e-20 
...

Is there a way to pass the fasta header contents in airr output format?

I am using igblast 1.18

Best regards Paweł

scharch commented 1 year ago

closing as out-of-scope, this functionality would be up to the authors of IgBlast