Open andrewyatz opened 2 years ago
Thanks, @andrewyatz, for your investigations! I see you base the algorithm on a cutoff (10 bases) to determine whether a gap is to be considered consecutive. Although this produces the same result as the AGP file in your test case, at least theoretically it might fail in other cases. For such an extension, why couldn't one just make directly use of the AGP file instead of basing the calculations on the FASTA sequence?
I agree that this should be considered an optional extension. I raised the idea earlier without getting much feedback, if I recall correctly, so I assumed this would be out of scope of the standard. But I do think it really fits well together with other arrays we have been considering, such as alphabet
and topology
.
The reason I was trying to investigate this method was because if we can base another metric on the sequence content rather than another out of band file or data source there would be an algorithm capable of doing this for any sequence.
As an update I just redid this. Worryingly yes the gaps did start to differ. This is a shame. It seems though either my code is broken badly (very possible) or the edits in each of these sequences is too much to recreate the gaps found in the AGP file
Following on from the previous discussions today I spent some time writing up a bit of code to find gaps in a FASTA file. It's in Python so you don't need to worry about Perl. I ran this over chromosome 22 from UCSC and Ensembl. I also verified this against the [AGP file from UCSC](wget https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/hg38.agp.gz).
I then ran the gap finder code on two files. One from UCSC which had to be pulled out using their
twoBitToFa
tool and the second from Ensembl's FTP site.Both representations of chromosome 22 produced a gap fingerprint identical to the AGP file above.
The biggest issue with this method is if there are no gaps in the underlying sequence, then this cannot be used to define a meaningful fingerprint. As said on the call the rise of full length assemblies and improvements in sequencing methods will mean there is limited value here but there is an argument that says this is useful.
I would suggest that if we take on this idea that it's an optional extension