ga4gh / refget

GA4GH Refget specifications docs
https://ga4gh.github.io/refget
14 stars 7 forks source link

Fingerprinting sequences using gaps #29

Open andrewyatz opened 2 years ago

andrewyatz commented 2 years ago

Following on from the previous discussions today I spent some time writing up a bit of code to find gaps in a FASTA file. It's in Python so you don't need to worry about Perl. I ran this over chromosome 22 from UCSC and Ensembl. I also verified this against the [AGP file from UCSC](wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz).

$ curl -s https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz  | gzip -dc | grep --regexp="\t[NU]\t" | grep "chr22\t"
chr22   1   10000   1   N   10000   telomere    no  na
chr22   10001   10510000    2   N   10500000    short_arm   no  na
chr22   10784644    10834643    5   N   50000   contig  no  na
chr22   10874573    10924572    7   N   50000   contig  no  na
chr22   10966725    11016724    9   N   50000   contig  no  na
chr22   11068988    11118987    12  N   50000   contig  no  na
chr22   11160922    11210921    14  N   50000   contig  no  na
chr22   11378057    11428056    16  N   50000   contig  no  na
chr22   11497338    11547337    19  N   50000   contig  no  na
chr22   11631289    11681288    23  N   50000   contig  no  na
chr22   11724630    11774629    25  N   50000   contig  no  na
chr22   11977556    12027555    27  N   50000   contig  no  na
chr22   12225589    12275588    29  N   50000   contig  no  na
chr22   12438691    12488690    31  N   50000   contig  no  na
chr22   12641731    12691730    33  N   50000   contig  no  na
chr22   12726205    12776204    35  N   50000   contig  no  na
chr22   12818138    12868137    37  N   50000   contig  no  na
chr22   12904789    12954788    39  N   50000   contig  no  na
chr22   12977326    12977425    41  N   100 contig  no  na
chr22   12986172    12994027    43  N   7856    scaffold    yes paired-ends
chr22   13011654    13014130    45  N   2477    scaffold    yes paired-ends
chr22   13021323    13021422    47  N   100 contig  no  na
chr22   13109445    13109544    49  N   100 contig  no  na
chr22   13163678    13163777    51  N   100 contig  no  na
chr22   13227313    13227412    53  N   100 contig  no  na
chr22   13248083    13248182    55  N   100 contig  no  na
chr22   13254853    13254952    57  N   100 contig  no  na
chr22   13258198    13258297    59  N   100 contig  no  na
chr22   13280859    13280958    61  N   100 contig  no  na
chr22   13285144    13285243    63  N   100 contig  no  na
chr22   14419455    14419554    65  N   100 contig  no  na
chr22   14419895    14419994    67  N   100 contig  no  na
chr22   14420335    14420434    69  N   100 contig  no  na
chr22   14421633    14421732    71  N   100 contig  no  na
chr22   15054319    15154318    73  N   100000  contig  no  na
chr22   16279673    16302843    99  N   23171   scaffold    yes unspecified
chr22   16304297    16305427    101 N   1131    scaffold    yes paired-ends
chr22   16307049    16307605    103 N   557 scaffold    yes paired-ends
chr22   16310303    16310402    105 N   100 scaffold    yes paired-ends
chr22   16313517    16314010    107 N   494 scaffold    yes paired-ends
chr22   18239130    18339129    142 N   100000  contig  no  na
chr22   18433514    18483513    144 N   50000   contig  no  na
chr22   18659565    18709564    146 N   50000   contig  no  na
chr22   49973866    49975365    1119    N   1500    contig  no  na
chr22   50808469    50818468    1153    N   10000   telomere    no  na

I then ran the gap finder code on two files. One from UCSC which had to be pulled out using their twoBitToFa tool and the second from Ensembl's FTP site.

Both representations of chromosome 22 produced a gap fingerprint identical to the AGP file above.

The biggest issue with this method is if there are no gaps in the underlying sequence, then this cannot be used to define a meaningful fingerprint. As said on the call the rise of full length assemblies and improvements in sequencing methods will mean there is limited value here but there is an argument that says this is useful.

I would suggest that if we take on this idea that it's an optional extension

sveinugu commented 2 years ago

Thanks, @andrewyatz, for your investigations! I see you base the algorithm on a cutoff (10 bases) to determine whether a gap is to be considered consecutive. Although this produces the same result as the AGP file in your test case, at least theoretically it might fail in other cases. For such an extension, why couldn't one just make directly use of the AGP file instead of basing the calculations on the FASTA sequence?

I agree that this should be considered an optional extension. I raised the idea earlier without getting much feedback, if I recall correctly, so I assumed this would be out of scope of the standard. But I do think it really fits well together with other arrays we have been considering, such as alphabet and topology.

andrewyatz commented 2 years ago

The reason I was trying to investigate this method was because if we can base another metric on the sequence content rather than another out of band file or data source there would be an algorithm capable of doing this for any sequence.

As an update I just redid this. Worryingly yes the gaps did start to differ. This is a shame. It seems though either my code is broken badly (very possible) or the edits in each of these sequences is too much to recreate the gaps found in the AGP file