NBISweden / IgDiscover-legacy

Analyze antibody repertoires and discover new V genes from high-throughput sequencing reads
https://www.igdiscover.se
MIT License
17 stars 10 forks source link

Where is the complete germline DataBase? #105

Closed CollinJ0 closed 10 months ago

CollinJ0 commented 4 years ago

Hi,

I have ran IgDiscover on a dataset of Rhesus Macaque IgM sequences. The final/database/V.fasta file appears to have only 2 sequences. How can I generate a fasta file with all of the V genes that this sample has?

Here is stats.json:

{ "version": "0.11", "read_preprocessing": { "total": 920964, "merged": 920964, "merging_was_done": false, "raw_reads": 920964, "after_primer_trimming": 831440, "grouping": { "unique_barcodes": 55774, "barcode_singletons": 34784, "groups_written": 177333, "group_size_1": 134235, "group_size_2": 12298, "group_size_3plus": 30800 } }, "iterations": [ { "database": { "size": 63 } }, { "assignment_filtering": { "total": 177333, "has_vj_assignment": 176276, "has_no_stop": 112355, "good_v_evalue": 112302, "good_v_coverage": 102943, "good_j_coverage": 99292, "has_cdr3": 99158 }, "database": { "size": 2, "gained": 1, "lost": 62, "size_pre": 19, "gained_pre": 18, "lost_pre": 62 } } ], "assignment_filtering": { "total": 177333, "has_vj_assignment": 176056, "has_no_stop": 112791, "good_v_evalue": 112639, "good_v_coverage": 86494, "good_j_coverage": 83377, "has_cdr3": 83289 } }

marcelm commented 4 years ago

The file final/database/V.fasta file is the correct one to look at. It should contain all V genes that were found in the sample. For some reason, the discovery process did not work in this case. I am a bit suprised because the reason for this is often that there weren’t enough reads, but this looks quite ok in your case.

We have released version 0.12 of IgDiscover quite recently. It contains many improvements compared to version 0.11. Can you try that version?

IgDiscover works by first generating a lot of V candidates and then trimming down that list to what it thinks are true germline sequences. Version 0.12 of IgDiscover creates a file iteration-01/annotated_V_germline.tab, which contains all the candidate V sequences, and it will also tell you (in the column why_filtered) why a candidate was filtered out. Looking at that file would be very helpful in your case. See the documentation.

Also, it would be important to know what you changed in the igdiscover.yaml file.