bio-guoda / preston

a biodiversity dataset tracker
MIT License
25 stars 1 forks source link

support marking of GenBank flat files in content stream #246

Closed jhpoelen closed 1 year ago

jhpoelen commented 1 year ago

GenBank flat files https://github.com/epam/NGB/issues/441 and https://www.ncbi.nlm.nih.gov/genbank/samplerecord/ are used to represent GenBank records.

The flat files begins with a line starting with LOCUS and ends with a line that only has // on it.

GenBank publishes gzipped data packages with a bunch of these flat files in them (see https://github.com/globalbioticinteractions/globalbioticinteractions/issues/904).

Suggested feature would help do something like:

preston ls\
 | preston genbank-alias-stream 

which would produce some stream of statements like:

<https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=U49845&rettype=gb&retmode=text> <...hasVersion> <line:gz:hash://sha256/abcdef1234...!/some.gz!/L345-L456> 

where

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=U49845&rettype=gb&retmode=text

is a url to a dynamic ncbi web service query that (may) retrieves a GenBank flat file by accession id, and line:...!/L345-L456 is the exact location of an associated accession record in some content.

jhpoelen commented 1 year ago

e.g.,

<https://ftp.ncbi.nlm.nih.gov/genbank/gbpln363.seq.gz> <http://purl.org/pav/hasVersion> <hash://sha256/7e23b7cc1d00f9c9c305e2d88bb7331bcd34fe7c8cee0ac2127bf7e5643512e7> <urn:uuid:7846a965-bd87-461d-9d60-056452867ff1> .

contains a (giant) record associated with accession LR828119 -

preston cat 'line:gz:hash://sha256/510ea17a974eaf35504ca24ad57e9be708196d38189353487eb312426ed9f0b4!/L11-L58'
LOCUS       AB000001                 660 bp    DNA     linear   PLN 15-JUL-2009
DEFINITION  Rhizoctonia solani genes for 18S rRNA, 5.8S rRNA, 28S rRNA, partial
            and complete sequence, isolate: #1.
ACCESSION   AB000001
VERSION     AB000001.1
KEYWORDS    .
SOURCE      Rhizoctonia solani
  ORGANISM  Rhizoctonia solani
            Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
            Agaricomycetes; Cantharellales; Ceratobasidiaceae; Rhizoctonia.
REFERENCE   1
  AUTHORS   Kuninaga,S., Natsuaki,T., Takeuchi,T. and Yokosawa,R.
  TITLE     Sequence variation of the rDNA ITS regions within and between
            anastomosis groups in Rhizoctonia solani
  JOURNAL   Curr. Genet. 32 (3), 237-243 (1997)
   PUBMED   9339350
REFERENCE   2  (bases 1 to 660)
  AUTHORS   Kuninaga,S.
  TITLE     Direct Submission
  JOURNAL   Submitted (19-DEC-1996) Contact:Shiro Kuninaga Health Sciences
            University of Hokkaido, General Education; 1757 Kanazawa, Tohbetsu,
            Hokkaido 061-02, Japan
FEATURES             Location/Qualifiers
     source          1..660
                     /organism="Rhizoctonia solani"
                     /mol_type="genomic DNA"
                     /isolate="#1"
                     /db_xref="taxon:456999"
                     /note="group: AG-3"
     rRNA            <1..6
                     /product="18S ribosomal RNA"
     rRNA            229..383
                     /product="5.8S ribosomal RNA"
     rRNA            656..>660
                     /product="28S ribosomal RNA"
ORIGIN      
        1 aattttaatg aagagtttgg ttgtagctgg cccattaatt taggcatgtg cacacctttc
       61 tctttcatcc catacacacc tgtgaacttg tgagacagat ggggaattta tttattgttt
      121 ttttttgtaa tataaagatg ataagtcatt gaacccttct gtctactcaa ctcatataaa
      181 ctcaatttat tttaaaatga atgtaatgga tgtaacgcat ctaatactaa gtttcaacaa
      241 cggatctctt ggctctcgca tcgatgaaga acgcagcgaa atgcgataag taatgtgaat
      301 tgcagaattc agtgaatcat cgaatctttg aacgcacctt gcgctccttg gtattccttg
      361 gagcatgcct gtttgagtat catgaaatct tcaaaatcaa gtcttttgtt aattcaattg
      421 gctttgactt tggtattgga ggtctttgca gcttcacacc tgctcctctt tgtacattag
      481 ctggatctca gtgttatgct tggttccact cagcgtgata agttatctat cgctgaggac
      541 actgtaaaaa ggtggccaag gtaaatgcag atgaaccgct tctaatagtc cattgacttg
      601 gacaatattt ttatgatctg atctcaaatc aggtaggact acccgctgaa cttaagcata
//
jhpoelen commented 1 year ago

with associated accession content retrieved from:

curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AB000001&rettype=gb&retmode=text"

LOCUS       AB000001                 660 bp    DNA     linear   PLN 15-JUL-2009
DEFINITION  Rhizoctonia solani genes for 18S rRNA, 5.8S rRNA, 28S rRNA, partial
            and complete sequence, isolate: #1.
ACCESSION   AB000001
VERSION     AB000001.1
KEYWORDS    .
SOURCE      Rhizoctonia solani
  ORGANISM  Rhizoctonia solani
            Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
            Agaricomycetes; Cantharellales; Ceratobasidiaceae; Rhizoctonia.
REFERENCE   1
  AUTHORS   Kuninaga,S., Natsuaki,T., Takeuchi,T. and Yokosawa,R.
  TITLE     Sequence variation of the rDNA ITS regions within and between
            anastomosis groups in Rhizoctonia solani
  JOURNAL   Curr. Genet. 32 (3), 237-243 (1997)
   PUBMED   9339350
REFERENCE   2  (bases 1 to 660)
  AUTHORS   Kuninaga,S.
  TITLE     Direct Submission
  JOURNAL   Submitted (19-DEC-1996) Contact:Shiro Kuninaga Health Sciences
            University of Hokkaido, General Education; 1757 Kanazawa, Tohbetsu,
            Hokkaido 061-02, Japan
FEATURES             Location/Qualifiers
     source          1..660
                     /organism="Rhizoctonia solani"
                     /mol_type="genomic DNA"
                     /isolate="#1"
                     /db_xref="taxon:456999"
                     /note="group: AG-3"
     rRNA            <1..6
                     /product="18S ribosomal RNA"
     rRNA            229..383
                     /product="5.8S ribosomal RNA"
     rRNA            656..>660
                     /product="28S ribosomal RNA"
ORIGIN      
        1 aattttaatg aagagtttgg ttgtagctgg cccattaatt taggcatgtg cacacctttc
       61 tctttcatcc catacacacc tgtgaacttg tgagacagat ggggaattta tttattgttt
      121 ttttttgtaa tataaagatg ataagtcatt gaacccttct gtctactcaa ctcatataaa
      181 ctcaatttat tttaaaatga atgtaatgga tgtaacgcat ctaatactaa gtttcaacaa
      241 cggatctctt ggctctcgca tcgatgaaga acgcagcgaa atgcgataag taatgtgaat
      301 tgcagaattc agtgaatcat cgaatctttg aacgcacctt gcgctccttg gtattccttg
      361 gagcatgcct gtttgagtat catgaaatct tcaaaatcaa gtcttttgtt aattcaattg
      421 gctttgactt tggtattgga ggtctttgca gcttcacacc tgctcctctt tgtacattag
      481 ctggatctca gtgttatgct tggttccact cagcgtgata agttatctat cgctgaggac
      541 actgtaaaaa ggtggccaag gtaaatgcag atgaaccgct tctaatagtc cattgacttg
      601 gacaatattt ttatgatctg atctcaaatc aggtaggact acccgctgaa cttaagcata
//
jhpoelen commented 1 year ago

in comparing the results from the webservice and associated data package,

$ preston cat 'line:gz:hash://sha256/510ea17a974eaf35504ca24ad57e9be708196d38189353487eb312426ed9f0b4!/L11-L58' > AB000001.gb
$ curl --silent "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AB000001&rettype=gb&retmode=text" > AB000001.gb.2
$ diff AB000001.gb AB000001.gb.2 
48c48,49
< //
\ No newline at end of file
---
> //
> 

so, it appears that the files only differ by a newline character. This may be a side effect of implementing the line syntax for preston. (fyi @mielliott) .

Which is confirmed by the matching sha256 signatures after manually adding a \n character -

$ cat\
 <(preston cat 'line:gz:hash://sha256/510ea17a974eaf35504ca24ad57e9be708196d38189353487eb312426ed9f0b4!/L11-L58') \
 <(echo -e "\n") | sha256sum
69c0fe025c8e088e714c20af09e4c68b1c681abfa6610b060006c889008cc601  -
$ cat AB000001.gb.2 | sha256sum
69c0fe025c8e088e714c20af09e4c68b1c681abfa6610b060006c889008cc601  -
jhpoelen commented 1 year ago

This means that we've created an citable, offline-enabled, version of the

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AB000001&rettype=gb&retmode=text

functionality.

mielliott commented 1 year ago

so, it appears that the files only differ by a newline character. This may be a side effect of implementing the line syntax for preston. (fyi @mielliott) .

@jhpoelen I just ran some tests for catting content ID'd by hash, alias, and lines. There are various quirks:

$ echo "this is a line" > with-newline.txt
$ echo -n "this is a line" > no-newline.txt
$ preston track file://$(pwd)/no-newline.txt file://$(pwd)/with-newline.txt | grep hasVersion
<file:///home/mielliott/test/no-newline.txt> <http://purl.org/pav/hasVersion> <hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28> <urn:uuid:a82ae34f-1441-4726-80af-09099de9ec71> .
<file:///home/mielliott/test/with-newline.txt> <http://purl.org/pav/hasVersion> <hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923> <urn:uuid:9ba2b4d6-834d-42d9-8daa-2501b4e9dec2> .

# Retrieval tests for no-newline.txt / hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28
## Ask by hash = OK
$ preston cat hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28  -

## Ask by alias = newline added
$ preston cat file:///home/mielliott/test/no-newline.txt | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923  -

## Ask for line 1 = OK
$ preston cat 'line:hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28!/L1' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28  -

## Ask for lines 1-2 = OK
$ preston cat 'line:hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28!/L1-L2' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28  -

# Retrieval tests for with-newline.txt / hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923
# Ask by hash = OK
$ preston cat hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923  -

# Ask by alias = OK
$ preston cat file:///home/mielliott/test/with-newline.txt | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923  -

# Ask for line 1 = newline removed
$ preston cat 'line:hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923!/L1' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28  -

# Ask for lines 1-2 = OK
$ preston cat 'line:hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923!/L1-L2' | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923  -

The quirks:

jhpoelen commented 1 year ago

@mielliott thanks for sharing your notes. Any intuitions on desired intuitive behavior?

mielliott commented 1 year ago

Well, the fact that preston cat [alias] behaves differently from preston cat [hash] is definitely naughty.

For the line: stuff, I suppose that depends on whether \n marks the beginning of the line or the end of one. The head command treats it as the end of a line:

$ echo "haha" | head -n1 | wc -c
5
$ echo -n "haha" | head -n1 | wc -c
4

Not sure if there's an official stance on whether \n is the beginning or end of a line. Maybe Google knows

mielliott commented 1 year ago

My personal preference would be to treat \n as the end of the line (if it's there, print it, otherwise don't add one), so that preston cat 'line:id!/L1' 'line:id!/L2' is the same as preston cat 'line:id!/L1-L2

Which would behave the same way as using head/tail to pluck out lines 1-2

mielliott commented 1 year ago

https://chat.openai.com/share/060d75d4-b91c-45e2-adec-55b14b8178fa

image

mielliott commented 1 year ago

So far the chat bots are in favor of \n being the end of a line, not the beginning

https://www.perplexity.ai/search/aeb7961b-5698-465c-b1c4-a0e05c3fff48

The character sequence "\n" represents a newline character, which is used to signify the end of a line of text and the start of a new one[1][2]. It is always used at the end of a line of text to indicate that the next character(s) should be printed on a new line. Therefore, "\n" is the end of a line, not the beginning[3][2].

In programming, the newline character is often used to format text output, and it is usually represented by the escape sequence "\n"[3][2]. In Python, for example, the print() function automatically adds a newline character at the end of its output, but you can also use the "\n" escape sequence to manually insert a newline character[3][2].

It is worth noting that the "^" and "$" symbols are sometimes used to denote the beginning and end of a line, respectively, in regular expressions and some programming languages[4][5]. However, this is a different concept from the newline character represented by "\n".

Citations: [1] https://en.wikipedia.org/wiki/Newline [2] https://www.idtech.com/blog/what-is-n-in-python [3] https://www.freecodecamp.org/news/python-new-line-and-how-to-python-print-without-a-newline/ [4] https://unix.stackexchange.com/questions/510770/when-and-why-did-and-take-on-their-meanings-of-beginning-of-line-and-end [5] https://www.regular-expressions.info/anchors.html

By Perplexity at https://www.perplexity.ai/search/aeb7961b-5698-465c-b1c4-a0e05c3fff48

mielliott commented 1 year ago

Sorry, I meant that the question is about whether "\n is part of the line" vs. "\n is a separator between lines". I don't think anyone's advocating for treating \n as the beginning of a line.

jhpoelen commented 1 year ago

Thanks for the digging and generating texts using general language models (how do you cite these models again?).

Sounds like \n (if present) is considered to be part of the line.

Wanna take a stab at implementing this? Or are you still busy writing your proposal?

mielliott commented 1 year ago

I'd cite the conversation with ChatGPT as a "personal correspondence".

Sure, I can take a look at it, I'll holler if something comes up though

Just to make sure we're on the same page @jhpoelen - preston's current behavior with line: is to remove the trailing endline, and this is causing records plucked from ncbi's web service outputs to have a different hash than their GenBank-packaged flat file counterparts? So having preston line: retrievals include the trailing \n should fix this issue, and it's a win-win because we'd prefer that preston doesn't strip off the \n anyway

Note https://github.com/bio-guoda/preston/issues/128#issuecomment-1110034753 might explain any deja vu

jhpoelen commented 1 year ago

Yes, line: retrievals including \n sounds like a good ol' win-win. Thanks for articulating the desired / current behavior.

jhpoelen commented 1 year ago

With current additions, the following genbank "flat file" -

LOCUS       KT156259                 329 bp    DNA     linear   PLN 31-MAY-2018
DEFINITION  [Chrysosporium] lobatum strain CBS 624.79 elongation factor 3 gene,
            partial cds.
ACCESSION   KT156259
VERSION     KT156259.1
KEYWORDS    .
SOURCE      [Chrysosporium] lobatum
  ORGANISM  [Chrysosporium] lobatum
            Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina;
            Eurotiomycetes; Eurotiomycetidae; Onygenales; Onygenaceae;
            Chrysosporium.
REFERENCE   1  (bases 1 to 329)
  AUTHORS   Stielow,J., Dukik,K., Goeker,M. and deHoog,G.
  TITLE     Phylogenetic revision of the order Onygenales
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 329)
  AUTHORS   Stielow,J., Dukik,K., Goeker,M. and deHoog,G.
  TITLE     Direct Submission
  JOURNAL   Submitted (03-APR-2015) Medical Mycology and Extremophile Fungi,
            CBS-KNAW Fungal Biodiversity Centre, Uppsalalaan 8, Utrecht,
            Utrecht 3584 CT, The Netherlands
COMMENT     ##Assembly-Data-START##
            Assembly Method       :: Biolomics v. 7
            Coverage              :: 1X
            Sequencing Technology :: Sanger dideoxy sequencing
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..329
                     /organism="[Chrysosporium] lobatum"
                     /mol_type="genomic DNA"
                     /strain="CBS 624.79"
                     /isolation_source="skin crust; Gallus gallus"
                     /culture_collection="CBS:624.79"
                     /db_xref="taxon:85844"
                     /country="Romania"
                     /collected_by="I. Alteras"
     mRNA            <1..>329
                     /product="elongation factor 3"
     CDS             <1..>329
                     /codon_start=1
                     /product="elongation factor 3"
                     /protein_id="AMQ77042.1"
                     /translation="KMKLALCRAVFEKPDILLLDEPTNHMDVKNVAWLEQYLINSPCT
                     SIIVSHDSKFLNNVIQHVIHYERFKLRRYRGNLTEFARRLPSARSYFELGASELEFKF
                     PEPGFLDG"
ORIGIN      
        1 aagatgaagc tcgctctctg ccgtgctgtg tttgagaagc ccgatatctt gcttcttgac
       61 gagcccacca accacatgga cgtgaagaac gtcgcctggt tggagcagta tcttatcaac
      121 tctccttgca cttccatcat cgtctcccac gacagcaagt tcttgaacaa cgtcatccag
      181 cacgttattc attacgagcg cttcaagctc cgccgttacc gcggtaactt gaccgagttc
      241 gccagacgtc tcccatccgc tcgctcgtac tttgaactcg gtgcctctga gctcgagttc
      301 aagttccctg agcccggttt cctcgacgg
//

is now exposed as -

{
  "accession": "KT156259",
  "definition": "[Chrysosporium] lobatum strain CBS 624.79 elongation factor 3 gene, partial cds.",
  "organism": "[Chrysosporium] lobatum",
  "db_xref": "taxon:85844",
  "country": "Romania",
  "http://www.w3.org/ns/prov#wasDerivedFrom": "line:gz:hash://sha256/8efca32f6aa1837303c1d8ea409eef8f0837ca743bddd02001bd1819d4504ed0!/L11-L63",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "genbank-flatfile"
}
jhpoelen commented 1 year ago

First version of gb-stream included in v0.6.4

jhpoelen commented 11 months ago

Also see https://github.com/jhpoelen/obi-genbank .