Closed jhpoelen closed 1 year ago
e.g.,
<https://ftp.ncbi.nlm.nih.gov/genbank/gbpln363.seq.gz> <http://purl.org/pav/hasVersion> <hash://sha256/7e23b7cc1d00f9c9c305e2d88bb7331bcd34fe7c8cee0ac2127bf7e5643512e7> <urn:uuid:7846a965-bd87-461d-9d60-056452867ff1> .
contains a (giant) record associated with accession LR828119 -
preston cat 'line:gz:hash://sha256/510ea17a974eaf35504ca24ad57e9be708196d38189353487eb312426ed9f0b4!/L11-L58'
LOCUS AB000001 660 bp DNA linear PLN 15-JUL-2009
DEFINITION Rhizoctonia solani genes for 18S rRNA, 5.8S rRNA, 28S rRNA, partial
and complete sequence, isolate: #1.
ACCESSION AB000001
VERSION AB000001.1
KEYWORDS .
SOURCE Rhizoctonia solani
ORGANISM Rhizoctonia solani
Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
Agaricomycetes; Cantharellales; Ceratobasidiaceae; Rhizoctonia.
REFERENCE 1
AUTHORS Kuninaga,S., Natsuaki,T., Takeuchi,T. and Yokosawa,R.
TITLE Sequence variation of the rDNA ITS regions within and between
anastomosis groups in Rhizoctonia solani
JOURNAL Curr. Genet. 32 (3), 237-243 (1997)
PUBMED 9339350
REFERENCE 2 (bases 1 to 660)
AUTHORS Kuninaga,S.
TITLE Direct Submission
JOURNAL Submitted (19-DEC-1996) Contact:Shiro Kuninaga Health Sciences
University of Hokkaido, General Education; 1757 Kanazawa, Tohbetsu,
Hokkaido 061-02, Japan
FEATURES Location/Qualifiers
source 1..660
/organism="Rhizoctonia solani"
/mol_type="genomic DNA"
/isolate="#1"
/db_xref="taxon:456999"
/note="group: AG-3"
rRNA <1..6
/product="18S ribosomal RNA"
rRNA 229..383
/product="5.8S ribosomal RNA"
rRNA 656..>660
/product="28S ribosomal RNA"
ORIGIN
1 aattttaatg aagagtttgg ttgtagctgg cccattaatt taggcatgtg cacacctttc
61 tctttcatcc catacacacc tgtgaacttg tgagacagat ggggaattta tttattgttt
121 ttttttgtaa tataaagatg ataagtcatt gaacccttct gtctactcaa ctcatataaa
181 ctcaatttat tttaaaatga atgtaatgga tgtaacgcat ctaatactaa gtttcaacaa
241 cggatctctt ggctctcgca tcgatgaaga acgcagcgaa atgcgataag taatgtgaat
301 tgcagaattc agtgaatcat cgaatctttg aacgcacctt gcgctccttg gtattccttg
361 gagcatgcct gtttgagtat catgaaatct tcaaaatcaa gtcttttgtt aattcaattg
421 gctttgactt tggtattgga ggtctttgca gcttcacacc tgctcctctt tgtacattag
481 ctggatctca gtgttatgct tggttccact cagcgtgata agttatctat cgctgaggac
541 actgtaaaaa ggtggccaag gtaaatgcag atgaaccgct tctaatagtc cattgacttg
601 gacaatattt ttatgatctg atctcaaatc aggtaggact acccgctgaa cttaagcata
//
with associated accession content retrieved from:
LOCUS AB000001 660 bp DNA linear PLN 15-JUL-2009
DEFINITION Rhizoctonia solani genes for 18S rRNA, 5.8S rRNA, 28S rRNA, partial
and complete sequence, isolate: #1.
ACCESSION AB000001
VERSION AB000001.1
KEYWORDS .
SOURCE Rhizoctonia solani
ORGANISM Rhizoctonia solani
Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
Agaricomycetes; Cantharellales; Ceratobasidiaceae; Rhizoctonia.
REFERENCE 1
AUTHORS Kuninaga,S., Natsuaki,T., Takeuchi,T. and Yokosawa,R.
TITLE Sequence variation of the rDNA ITS regions within and between
anastomosis groups in Rhizoctonia solani
JOURNAL Curr. Genet. 32 (3), 237-243 (1997)
PUBMED 9339350
REFERENCE 2 (bases 1 to 660)
AUTHORS Kuninaga,S.
TITLE Direct Submission
JOURNAL Submitted (19-DEC-1996) Contact:Shiro Kuninaga Health Sciences
University of Hokkaido, General Education; 1757 Kanazawa, Tohbetsu,
Hokkaido 061-02, Japan
FEATURES Location/Qualifiers
source 1..660
/organism="Rhizoctonia solani"
/mol_type="genomic DNA"
/isolate="#1"
/db_xref="taxon:456999"
/note="group: AG-3"
rRNA <1..6
/product="18S ribosomal RNA"
rRNA 229..383
/product="5.8S ribosomal RNA"
rRNA 656..>660
/product="28S ribosomal RNA"
ORIGIN
1 aattttaatg aagagtttgg ttgtagctgg cccattaatt taggcatgtg cacacctttc
61 tctttcatcc catacacacc tgtgaacttg tgagacagat ggggaattta tttattgttt
121 ttttttgtaa tataaagatg ataagtcatt gaacccttct gtctactcaa ctcatataaa
181 ctcaatttat tttaaaatga atgtaatgga tgtaacgcat ctaatactaa gtttcaacaa
241 cggatctctt ggctctcgca tcgatgaaga acgcagcgaa atgcgataag taatgtgaat
301 tgcagaattc agtgaatcat cgaatctttg aacgcacctt gcgctccttg gtattccttg
361 gagcatgcct gtttgagtat catgaaatct tcaaaatcaa gtcttttgtt aattcaattg
421 gctttgactt tggtattgga ggtctttgca gcttcacacc tgctcctctt tgtacattag
481 ctggatctca gtgttatgct tggttccact cagcgtgata agttatctat cgctgaggac
541 actgtaaaaa ggtggccaag gtaaatgcag atgaaccgct tctaatagtc cattgacttg
601 gacaatattt ttatgatctg atctcaaatc aggtaggact acccgctgaa cttaagcata
//
in comparing the results from the webservice and associated data package,
$ preston cat 'line:gz:hash://sha256/510ea17a974eaf35504ca24ad57e9be708196d38189353487eb312426ed9f0b4!/L11-L58' > AB000001.gb
$ curl --silent "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AB000001&rettype=gb&retmode=text" > AB000001.gb.2
$ diff AB000001.gb AB000001.gb.2
48c48,49
< //
\ No newline at end of file
---
> //
>
so, it appears that the files only differ by a newline character. This may be a side effect of implementing the line
syntax for preston. (fyi @mielliott) .
Which is confirmed by the matching sha256 signatures after manually adding a \n
character -
$ cat\
<(preston cat 'line:gz:hash://sha256/510ea17a974eaf35504ca24ad57e9be708196d38189353487eb312426ed9f0b4!/L11-L58') \
<(echo -e "\n") | sha256sum
69c0fe025c8e088e714c20af09e4c68b1c681abfa6610b060006c889008cc601 -
$ cat AB000001.gb.2 | sha256sum
69c0fe025c8e088e714c20af09e4c68b1c681abfa6610b060006c889008cc601 -
This means that we've created an citable, offline-enabled, version of the
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=AB000001&rettype=gb&retmode=text
functionality.
so, it appears that the files only differ by a newline character. This may be a side effect of implementing the line syntax for preston. (fyi @mielliott) .
@jhpoelen I just ran some tests for catting content ID'd by hash, alias, and lines. There are various quirks:
$ echo "this is a line" > with-newline.txt
$ echo -n "this is a line" > no-newline.txt
$ preston track file://$(pwd)/no-newline.txt file://$(pwd)/with-newline.txt | grep hasVersion
<file:///home/mielliott/test/no-newline.txt> <http://purl.org/pav/hasVersion> <hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28> <urn:uuid:a82ae34f-1441-4726-80af-09099de9ec71> .
<file:///home/mielliott/test/with-newline.txt> <http://purl.org/pav/hasVersion> <hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923> <urn:uuid:9ba2b4d6-834d-42d9-8daa-2501b4e9dec2> .
# Retrieval tests for no-newline.txt / hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28
## Ask by hash = OK
$ preston cat hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
## Ask by alias = newline added
$ preston cat file:///home/mielliott/test/no-newline.txt | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
## Ask for line 1 = OK
$ preston cat 'line:hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28!/L1' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
## Ask for lines 1-2 = OK
$ preston cat 'line:hash://sha256/ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28!/L1-L2' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
# Retrieval tests for with-newline.txt / hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923
# Ask by hash = OK
$ preston cat hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
# Ask by alias = OK
$ preston cat file:///home/mielliott/test/with-newline.txt | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
# Ask for line 1 = newline removed
$ preston cat 'line:hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923!/L1' | sha256sum
ce529d4c495145caed2ff70aad73daa6dead03b0f34c051a3cc39d8b9e258b28 -
# Ask for lines 1-2 = OK
$ preston cat 'line:hash://sha256/9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923!/L1-L2' | sha256sum
9e9ede53c8b612ea8518905d7a20eb1d292af585d1f7e95e499a3af76161e923 -
The quirks:
preston cat [alias]
likes to print a newline at the endpreston cat 'line:[id]!/L1
does not print a newline at the endpreston cat 'line:[id]!/L1-L2
- requesting more lines than are available - does not add a new line (this is good)@mielliott thanks for sharing your notes. Any intuitions on desired intuitive behavior?
Well, the fact that preston cat [alias]
behaves differently from preston cat [hash]
is definitely naughty.
For the line:
stuff, I suppose that depends on whether \n marks the beginning of the line or the end of one. The head
command treats it as the end of a line:
$ echo "haha" | head -n1 | wc -c
5
$ echo -n "haha" | head -n1 | wc -c
4
Not sure if there's an official stance on whether \n is the beginning or end of a line. Maybe Google knows
My personal preference would be to treat \n as the end of the line (if it's there, print it, otherwise don't add one), so that
preston cat 'line:id!/L1' 'line:id!/L2'
is the same as
preston cat 'line:id!/L1-L2
Which would behave the same way as using head/tail to pluck out lines 1-2
So far the chat bots are in favor of \n being the end of a line, not the beginning
https://www.perplexity.ai/search/aeb7961b-5698-465c-b1c4-a0e05c3fff48
The character sequence "\n" represents a newline character, which is used to signify the end of a line of text and the start of a new one[1][2]. It is always used at the end of a line of text to indicate that the next character(s) should be printed on a new line. Therefore, "\n" is the end of a line, not the beginning[3][2].
In programming, the newline character is often used to format text output, and it is usually represented by the escape sequence "\n"[3][2]. In Python, for example, the print() function automatically adds a newline character at the end of its output, but you can also use the "\n" escape sequence to manually insert a newline character[3][2].
It is worth noting that the "^" and "$" symbols are sometimes used to denote the beginning and end of a line, respectively, in regular expressions and some programming languages[4][5]. However, this is a different concept from the newline character represented by "\n".
Citations: [1] https://en.wikipedia.org/wiki/Newline [2] https://www.idtech.com/blog/what-is-n-in-python [3] https://www.freecodecamp.org/news/python-new-line-and-how-to-python-print-without-a-newline/ [4] https://unix.stackexchange.com/questions/510770/when-and-why-did-and-take-on-their-meanings-of-beginning-of-line-and-end [5] https://www.regular-expressions.info/anchors.html
By Perplexity at https://www.perplexity.ai/search/aeb7961b-5698-465c-b1c4-a0e05c3fff48
Sorry, I meant that the question is about whether "\n is part of the line" vs. "\n is a separator between lines". I don't think anyone's advocating for treating \n as the beginning of a line.
Thanks for the digging and generating texts using general language models (how do you cite these models again?).
Sounds like \n
(if present) is considered to be part of the line.
Wanna take a stab at implementing this? Or are you still busy writing your proposal?
I'd cite the conversation with ChatGPT as a "personal correspondence".
Sure, I can take a look at it, I'll holler if something comes up though
Just to make sure we're on the same page @jhpoelen - preston's current behavior with line:
is to remove the trailing endline, and this is causing records plucked from ncbi's web service outputs to have a different hash than their GenBank-packaged flat file counterparts? So having preston line:
retrievals include the trailing \n should fix this issue, and it's a win-win because we'd prefer that preston doesn't strip off the \n anyway
Note https://github.com/bio-guoda/preston/issues/128#issuecomment-1110034753 might explain any deja vu
Yes, line:
retrievals including \n
sounds like a good ol' win-win. Thanks for articulating the desired / current behavior.
With current additions, the following genbank "flat file" -
LOCUS KT156259 329 bp DNA linear PLN 31-MAY-2018
DEFINITION [Chrysosporium] lobatum strain CBS 624.79 elongation factor 3 gene,
partial cds.
ACCESSION KT156259
VERSION KT156259.1
KEYWORDS .
SOURCE [Chrysosporium] lobatum
ORGANISM [Chrysosporium] lobatum
Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina;
Eurotiomycetes; Eurotiomycetidae; Onygenales; Onygenaceae;
Chrysosporium.
REFERENCE 1 (bases 1 to 329)
AUTHORS Stielow,J., Dukik,K., Goeker,M. and deHoog,G.
TITLE Phylogenetic revision of the order Onygenales
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 329)
AUTHORS Stielow,J., Dukik,K., Goeker,M. and deHoog,G.
TITLE Direct Submission
JOURNAL Submitted (03-APR-2015) Medical Mycology and Extremophile Fungi,
CBS-KNAW Fungal Biodiversity Centre, Uppsalalaan 8, Utrecht,
Utrecht 3584 CT, The Netherlands
COMMENT ##Assembly-Data-START##
Assembly Method :: Biolomics v. 7
Coverage :: 1X
Sequencing Technology :: Sanger dideoxy sequencing
##Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..329
/organism="[Chrysosporium] lobatum"
/mol_type="genomic DNA"
/strain="CBS 624.79"
/isolation_source="skin crust; Gallus gallus"
/culture_collection="CBS:624.79"
/db_xref="taxon:85844"
/country="Romania"
/collected_by="I. Alteras"
mRNA <1..>329
/product="elongation factor 3"
CDS <1..>329
/codon_start=1
/product="elongation factor 3"
/protein_id="AMQ77042.1"
/translation="KMKLALCRAVFEKPDILLLDEPTNHMDVKNVAWLEQYLINSPCT
SIIVSHDSKFLNNVIQHVIHYERFKLRRYRGNLTEFARRLPSARSYFELGASELEFKF
PEPGFLDG"
ORIGIN
1 aagatgaagc tcgctctctg ccgtgctgtg tttgagaagc ccgatatctt gcttcttgac
61 gagcccacca accacatgga cgtgaagaac gtcgcctggt tggagcagta tcttatcaac
121 tctccttgca cttccatcat cgtctcccac gacagcaagt tcttgaacaa cgtcatccag
181 cacgttattc attacgagcg cttcaagctc cgccgttacc gcggtaactt gaccgagttc
241 gccagacgtc tcccatccgc tcgctcgtac tttgaactcg gtgcctctga gctcgagttc
301 aagttccctg agcccggttt cctcgacgg
//
is now exposed as -
{
"accession": "KT156259",
"definition": "[Chrysosporium] lobatum strain CBS 624.79 elongation factor 3 gene, partial cds.",
"organism": "[Chrysosporium] lobatum",
"db_xref": "taxon:85844",
"country": "Romania",
"http://www.w3.org/ns/prov#wasDerivedFrom": "line:gz:hash://sha256/8efca32f6aa1837303c1d8ea409eef8f0837ca743bddd02001bd1819d4504ed0!/L11-L63",
"http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "genbank-flatfile"
}
First version of gb-stream included in v0.6.4
Also see https://github.com/jhpoelen/obi-genbank .
GenBank flat files https://github.com/epam/NGB/issues/441 and https://www.ncbi.nlm.nih.gov/genbank/samplerecord/ are used to represent GenBank records.
The flat files begins with a line starting with
LOCUS
and ends with a line that only has//
on it.GenBank publishes gzipped data packages with a bunch of these flat files in them (see https://github.com/globalbioticinteractions/globalbioticinteractions/issues/904).
Suggested feature would help do something like:
which would produce some stream of statements like:
where
is a url to a dynamic ncbi web service query that (may) retrieves a GenBank flat file by accession id, and
line:...!/L345-L456
is the exact location of an associated accession record in some content.