Closed eweitz closed 8 years ago
Any idea on the estimated time for this to be completed? I can also try and help if I can get caught up to speed.
I would like for Ideogram.js to have basic support for all eukaryotes that have suitable data before August.
Any help would be appreciated!
Development can be divided into two tasks: data retrieval and rendering. If you want to help, @ProjectProgramAMark, I would recommend trying the data retrieval task. I'll take care of rendering.
Given an organism's scientific name, get a list of chromosomes in its genome and their length in nucleotide base pairs. Each chromosome's length in base pairs (bp) is proportional to its length in pixels (px) after rendering: chrLength(bp) ~ chrLength(px).
Implement the data retrieval using D3's xhr module such that no server-side code is required by developers using this library feature.
A draft dataflow for Plasmodium falciparum is below. Details are of course likely to change, but I think the gist below will work. If you would like to help on this, I would recommend implementing a function for this in JavaScript and D3 outside Ideogram.js before integrating it into the library.
We want to find the best genome assembly for the input organism. To accomplish this, query NCBI Assembly database via EUtils esearch.
term
value will likely be refined over time, but this is a decent start.){
"header": {
"type": "esearch",
"version": "0.3"
},
"esearchresult": {
"count": "1",
"retmax": "1",
"retstart": "0",
"idlist": [
"360518"
],
...
Parse first element from idlist
key of esearch JSON response, e.g. 360518
.
Resolve that internal identifier to a public identifier -- the assembly's RefSeq accession -- via EUtils esummary as follows.
{
"header": {
"type": "esummary",
"version": "0.3"
},
"result": {
"uids": [
"360518"
],
"360518": {
"uid": "360518",
"rsuid": "360518",
"gbuid": "256198",
"assemblyaccession": "GCF_000002765.3",
"lastmajorreleaseaccession": "GCF_000002765.3",
"chainid": "2765",
"assemblyname": "ASM276v1",
Parse value of assemblyaccesion
esummary JSON response, e.g. GCF_000002765.3
above.
The RefSeq accession represents the "best" genome assembly for the organism, or, more precisely, an assembly which should have sufficient data for the organism's chromosome complement.
Now that we know the organism's best genome assembly, we can get a list of its chromosomes and their length.
Using the assembly RefSeq accession obtained from the previous step, get its full sequence report.
...
# Sequence-Name Sequence-Role Assigned-Molecule Assigned-Molecule-Location/Type GenBank-Accn Relationship RefSeq-Accn Assembly-Unit Sequence-Length UCSC-style-name
1 assembled-molecule 1 Chromosome AL844501.1 = NC_004325.1 Primary Assembly 643292 na
2 assembled-molecule 2 Chromosome AE001362.1 = NC_000910.2 Primary Assembly 947102 na
3 assembled-molecule 3 Chromosome AL844502.1 = NC_000521.3 Primary Assembly 1060087 na
4 assembled-molecule 4 Chromosome AL844503.1 = NC_004318.1 Primary Assembly 1204112 na
5 assembled-molecule 5 Chromosome AL844504.1 = NC_004326.1 Primary Assembly 1343552 na
6 assembled-molecule 6 Chromosome AL844505.1 = NC_004327.2 Primary Assembly 1418244 na
7 assembled-molecule 7 Chromosome AL844506.2 = NC_004328.2 Primary Assembly 1501717 na
8 assembled-molecule 8 Chromosome AL844507.2 = NC_004329.2 Primary Assembly 1419563 na
9 assembled-molecule 9 Chromosome AL844508.1 = NC_004330.1 Primary Assembly 1541723 na
10 assembled-molecule 10 Chromosome AE014185.2 = NC_004314.2 Primary Assembly 1687655 na
11 assembled-molecule 11 Chromosome AE014186.2 = NC_004315.2 Primary Assembly 2038337 na
12 assembled-molecule 12 Chromosome AE014188.3 = NC_004316.3 Primary Assembly 2271478 na
13 assembled-molecule 13 Chromosome AL844509.2 = NC_004331.2 Primary Assembly 2895605 na
14 assembled-molecule 14 Chromosome AE014187.2 = NC_004317.2 Primary Assembly 3291871 na
MT assembled-molecule MT Mitochondrion na <> NC_002375.1 non-nuclear 5967 na
Here Sequence-Name
is the chromosome name and Sequence-Length
is the chromosomes length in base pairs. Splits those rows by tab, parse name
and length
values for each chromosome, and put them into an array of objects as shown in the example output in the following section.
// Implement getChromosomes() function that takes scientific name as an argument
getChromosomes("Plasmodium falciparum")
// Array of objects with basic data on all chromosomes in Plasmodium falciparum
[
{"name": "1", "length": 643292},
{"name": "2", "length": 947102},
...
{"name": "MT", "length": 5967}
]
@eweitz, I'm having a bit of trouble with sending the request to get the full sequence report using the assembly RefSeq accession (for ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt). It's returning the following error:
XMLHttpRequest cannot load https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:3000' is therefore not allowed access.
I'm pretty sure this is a CORS problem, but I'm not sure how to get around that using only d3js. I'm running the test environment off of a NodeJS server, and am receiving the same error when I run it on Apache. I have uploaded a repo of my test environment here.
@ProjectProgramAMark , I sent you this pull request, with some responses.
Thanks @Klortho, the only thing I'm unsure about it @eweitz specified he didn't want any server code being used in this feature, so I wasn't sure if getting around CORS was only a temporary fix in my problem and didn't serve the bigger picture. I went ahead and merged your pull request though.
Oh, right. Well, this is purely in the transport layer -- nothing to do with D3.
@ProjectProgramAMark, can you try the following workaround? It required much sleuthing to determine, but the method described below gets all data from EUtils, and thus should avoid the CORS issue.
I quickly checked via browsing EUtils API results that the following approach using the little-known GenColl database works straightforwardly not only for Plasmodium falciparum, but also for Homo sapiens and Drosophila melanogaster, unlike several other approaches I tried with the better-known databases Assembly and BioProject.
Parse value of rsuid
esummary JSON response, e.g. 360518
in the "Get best genome for organism" section of my previous comment.
(Data recap: the rsuid 360518
is the internal RefSeq UID for the RefSeq genome assembly GCF_000002765.3
, i.e. ASM276v1, the latest chromosome-level RefSeq assembly for organism Plasmodium falciparum.)
Get a list of chromosome UIDs linked to Nucleotide (nuccore) from GenColl database for genome assembly 360518
:
{
"header": {
"type": "elink",
"version": "0.3"
},
"linksets": [
{
"dbfrom": "pubmed",
"ids": [
360518
],
"linksetdbs": [
{
"dbto": "nuccore",
"linkname": "gencoll_nuccore_chr",
"links": [
296005645,
296005143,
296004920,
258549241,
258549170,
258549151,
258549100,
86176855,
23957709,
23613523,
23613362,
23613028,
23593254,
23509994,
11466244
]
}
]
}
]
}
Parse links
, and join the elements of that array into a comma-delimited string (e.g. ids = links.join(",")
).
Pass that string of chromosome UIDs into the id
parameter of an ESummary call to the Nucleotide database.
...
"result": {
"uids": [
"296005645",
"296005143",
"296004920",
"258549241",
"258549170",
"258549151",
"258549100",
"86176855",
"23957709",
"23613523",
"23613362",
"23613028",
"23593254",
"23509994",
"11466244"
],
"296005645": {
"uid": "296005645",
"caption": "NC_004331",
"title": "Plasmodium falciparum 3D7 chromosome 13",
"extra": "gi|296005645|ref|NC_004331.2||gnl|NCBI_GENOMES|103",
"gi": 296005645,
"createdate": "2002/10/03",
"updatedate": "2010/07/29",
"flags": 512,
"taxid": 36329,
"slen": 2895605,
"biomol": "genomic",
"moltype": "dna",
"topology": "linear",
"sourcedb": "refseq",
"segsetsize": "",
"projectid": "148",
"genome": "chromosome",
"subtype": "chromosome",
"subname": "13",
"assemblygi": 225631926,
"assemblyacc": "AL844509",
"tech": "",
"completeness": "",
"geneticcode": "1",
"strand": "",
"organism": "Plasmodium falciparum 3D7",
"strain": "",
"statistics": [
{
"type": "Length",
"count": 2895605
Here subname
is the chromosome name and slength
is the chromosomes length in base pairs. Iterate over each key-value pair in result
(skip uids
), parse name and length values for each chromosome, and put them into an array of objects as shown in the "Example output" section in my previous comment.
Notes:
subname
may require additional parsing to get the canonical, human-friendly chromosome name. Example: Drosophila chromosomes -- the expected name the first chromosome result there is "3L", but its subname
value has lots of noise.Thanks for the pull request, @Klortho, but @ProjectProgramAMark is correct: I would really prefer this feature to require no server-side code or configuration. This feature should work with a primitive, traditional web server stack, e.g. on a static web page served by Apache or Nginx.
I wonder if the lack of an Access-Control-Allow-Origin
HTTP header in https://ftp.ncbi.nlm.nih.gov is a matter of security, or if it's something that simply has not been implemented yet. I suspect it's the latter. Given the proliferation of client-side API calls and unique information available via FTP -- e.g. full sequence reports like https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt are the only machine-friendly resources I know that provide expected chromosome ordering, and conveniently include chromosome name and length -- I think eliminating the CORS restriction on NCBI's side would be widely beneficial.
@ProjectProgramAMark, if you are still interested in this, please let me know. Otherwise I will begin wiring up data retrieval for this in about a week.
@eweitz Where would be the ideal place/file to place this script in?
@ProjectProgramAMark, what you have in d3test/public/index.html is a good start. I recommend continuing that path, perhaps in a fork of this repository. Once you can return names and lengths for chromosomes given only a scientific name like Plasmodium falciparum -- i.e. without hardcoded intermediate values like rsuid "360518"
-- ping me and we'll proceed from there.
Ideally, your data retrieval function will ultimately be a single method getChromosomes(organismName)
in src/js/ideogram.js, e.g.:
/**
* Returns names and lengths of chromosomes for an organism's best-known genome assembly
*/
Ideogram.prototype.getChromosomes = function(organismName) {
// Your data retrieval code
}
But let's take this one step at a time. First get a working independent function, then we'll integrate this into src/js/ideogram.js
. Integrating will require getting familiar with Ideogram's complex initialization logic, which I can help with when we get there.
Once you have a standalone function getChromosomes(organismName)
working without hardcoded values and returning something like the example output from my 6/29 comment, please let me know.
@eweitz my script should be working now. pull my repo and cd into the directory, run "npm start" and click on the button.
@ProjectProgramAMark, your data retrieval code looks good so far! Thanks for those instructions. I verified that your current getChromosomes(organismName)
function gets chromosome lengths and names for Plasmodium falciparum.
I think the next step is to begin integrating your function into this library, i.e. src/js/ideogram.js
.
Please:
eukaryotes-data-retrieval
src/js/ideogram.js
getChromosomes
as an instance method of Ideogram; see outline in my previous comment. ideogram
repo's eukaryotes-data-retrieval
branch into this repo's master branchIf you're feeling ambitious, try calling ideogram.getChromosomes("Plasmodium falciparum")
after you install your method on Ideogram's prototype
. But don't worry if it doesn't work, or if Ideogram is giving you trouble installing. The main goal is to get a pull request with more or less your current getChromosomes
function open, and begin a code review. I can take care of any integration trouble.
In the code review, I'll recommend and make various code updates. Your current function will undergo some transformations, but its essence looks roughly OK at an initial glance.
@eweitz ok done! Let me know what you want to change.
@eweitz any time estimation on when this will be finished?
I hope to have a basic version of this available in a week or two.
@eweitz any progress?
@ProjectProgramAMark, yes, slowly but steadily. Integrating the data retrieval code into the larger Ideogram library turned out to require major work that cut across more aspects of Ideogram than expected. See #54 for details.
I'm now beginning on the rendering task of this feature. I will ping you when it's done.
Update: @ProjectProgramAMark, this is done -- see e.g. https://eweitz.github.io/ideogram/eukaryotes.html?org=plasmodium-falciparum. Thanks again for your help!
@mrouard, @ProjectProgramAMark, rough basic rendering of some arbitrary eukaryotic genomes is in place in the development branch https://github.com/eweitz/ideogram/tree/render-eukaryotic-chromosomes.
That branch can retrieve chromosome data for e.g. Plasmodium falciparum (malaria parasite), Caenorhabditis elegans (worm) and Musa acuminata (banana) given only the organism's scientific name. See examples/worm.html in the render-eukaryotic-chromosomes
branch for an example of how this feature will look at the app-developer level.
I'll comment here with a progress update within a week.
The rendering of eukaryotic chromosomes is significantly better than it was last week. I've also replaced the worm.html example with something more expansive, examples/eukaryotes.html.
My next task is to fix the failing automated test suite. As before, the place to follow progress is https://github.com/eweitz/ideogram/tree/render-eukaryotic-chromosomes.
This feature is done. Support for eukaryotes can be found at:
Looks great @eweitz
Looking at some examples, there is like a bug display https://eweitz.github.io/ideogram/eukaryotes.html?org=arabidopsis-thaliana
There is like a very small additional chromosome. same for maize, rice and grape.
Thanks for noting that problem, @mrouard. I've opened issue #56 to address it.
Develop support for basic depictions for the chromosome complement of all eukaryotes. Integrate a third-party web API to retrieve chromosome count and length data for arbitrary taxa.