eweitz commented 8 years ago

Develop support for basic depictions for the chromosome complement of all eukaryotes. Integrate a third-party web API to retrieve chromosome count and length data for arbitrary taxa.

ProjectProgramAMark commented 8 years ago

Any idea on the estimated time for this to be completed? I can also try and help if I can get caught up to speed.

eweitz commented 8 years ago

I would like for Ideogram.js to have basic support for all eukaryotes that have suitable data before August.

Any help would be appreciated!

Development can be divided into two tasks: data retrieval and rendering. If you want to help, @ProjectProgramAMark, I would recommend trying the data retrieval task. I'll take care of rendering.

Data retrieval

Given an organism's scientific name, get a list of chromosomes in its genome and their length in nucleotide base pairs. Each chromosome's length in base pairs (bp) is proportional to its length in pixels (px) after rendering: chrLength(bp) ~ chrLength(px).

Implement the data retrieval using D3's xhr module such that no server-side code is required by developers using this library feature.

A draft dataflow for Plasmodium falciparum is below. Details are of course likely to change, but I think the gist below will work. If you would like to help on this, I would recommend implementing a function for this in JavaScript and D3 outside Ideogram.js before integrating it into the library.

Get best genome for organism

We want to find the best genome assembly for the input organism. To accomplish this, query NCBI Assembly database via EUtils esearch.

Request: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=assembly&retmode=json&term=Plasmodium%20falciparum%20AND%20(%22latest%20refseq%22[filter])%20AND%20%22chromosome%20level%22[filter]) (The term value will likely be refined over time, but this is a decent start.)
Response:

{
    "header": {
        "type": "esearch",
        "version": "0.3"
    },
    "esearchresult": {
        "count": "1",
        "retmax": "1",
        "retstart": "0",
        "idlist": [
            "360518"
        ],
    ...

Parse first element from idlist key of esearch JSON response, e.g. 360518.

Resolve that internal identifier to a public identifier -- the assembly's RefSeq accession -- via EUtils esummary as follows.

Request: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=assembly&retmode=json&id=360518
Response:

{
    "header": {
        "type": "esummary",
        "version": "0.3"
    },
    "result": {
        "uids": [
            "360518"
        ],
        "360518": {
            "uid": "360518",
            "rsuid": "360518",
            "gbuid": "256198",
            "assemblyaccession": "GCF_000002765.3",
            "lastmajorreleaseaccession": "GCF_000002765.3",
            "chainid": "2765",
            "assemblyname": "ASM276v1",

Parse value of assemblyaccesion esummary JSON response, e.g. GCF_000002765.3 above.

The RefSeq accession represents the "best" genome assembly for the organism, or, more precisely, an assembly which should have sufficient data for the organism's chromosome complement.

Get chromosomes for genome

Now that we know the organism's best genome assembly, we can get a list of its chromosomes and their length.

Using the assembly RefSeq accession obtained from the previous step, get its full sequence report.

Request: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt
Response:

...
# Sequence-Name Sequence-Role   Assigned-Molecule   Assigned-Molecule-Location/Type GenBank-Accn    Relationship    RefSeq-Accn Assembly-Unit   Sequence-Length UCSC-style-name
1   assembled-molecule  1   Chromosome  AL844501.1  =   NC_004325.1 Primary Assembly    643292  na
2   assembled-molecule  2   Chromosome  AE001362.1  =   NC_000910.2 Primary Assembly    947102  na
3   assembled-molecule  3   Chromosome  AL844502.1  =   NC_000521.3 Primary Assembly    1060087 na
4   assembled-molecule  4   Chromosome  AL844503.1  =   NC_004318.1 Primary Assembly    1204112 na
5   assembled-molecule  5   Chromosome  AL844504.1  =   NC_004326.1 Primary Assembly    1343552 na
6   assembled-molecule  6   Chromosome  AL844505.1  =   NC_004327.2 Primary Assembly    1418244 na
7   assembled-molecule  7   Chromosome  AL844506.2  =   NC_004328.2 Primary Assembly    1501717 na
8   assembled-molecule  8   Chromosome  AL844507.2  =   NC_004329.2 Primary Assembly    1419563 na
9   assembled-molecule  9   Chromosome  AL844508.1  =   NC_004330.1 Primary Assembly    1541723 na
10  assembled-molecule  10  Chromosome  AE014185.2  =   NC_004314.2 Primary Assembly    1687655 na
11  assembled-molecule  11  Chromosome  AE014186.2  =   NC_004315.2 Primary Assembly    2038337 na
12  assembled-molecule  12  Chromosome  AE014188.3  =   NC_004316.3 Primary Assembly    2271478 na
13  assembled-molecule  13  Chromosome  AL844509.2  =   NC_004331.2 Primary Assembly    2895605 na
14  assembled-molecule  14  Chromosome  AE014187.2  =   NC_004317.2 Primary Assembly    3291871 na
MT  assembled-molecule  MT  Mitochondrion   na  <>  NC_002375.1 non-nuclear 5967    na

Here Sequence-Name is the chromosome name and Sequence-Length is the chromosomes length in base pairs. Splits those rows by tab, parse name and length values for each chromosome, and put them into an array of objects as shown in the example output in the following section.

Example input and output

Input:

// Implement getChromosomes() function that takes scientific name as an argument
getChromosomes("Plasmodium falciparum")

Output:

// Array of objects with basic data on all chromosomes in Plasmodium falciparum
[
  {"name": "1", "length": 643292},
  {"name": "2", "length": 947102},
  ...
  {"name": "MT", "length": 5967}
]

ProjectProgramAMark commented 8 years ago

@eweitz, I'm having a bit of trouble with sending the request to get the full sequence report using the assembly RefSeq accession (for ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt). It's returning the following error:

XMLHttpRequest cannot load https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt. No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:3000' is therefore not allowed access.

I'm pretty sure this is a CORS problem, but I'm not sure how to get around that using only d3js. I'm running the test environment off of a NodeJS server, and am receiving the same error when I run it on Apache. I have uploaded a repo of my test environment here.

Klortho commented 8 years ago

@ProjectProgramAMark , I sent you this pull request, with some responses.

ProjectProgramAMark commented 8 years ago

Thanks @Klortho, the only thing I'm unsure about it @eweitz specified he didn't want any server code being used in this feature, so I wasn't sure if getting around CORS was only a temporary fix in my problem and didn't serve the bigger picture. I went ahead and merged your pull request though.

Klortho commented 8 years ago

Oh, right. Well, this is purely in the transport layer -- nothing to do with D3.

eweitz commented 8 years ago

@ProjectProgramAMark, can you try the following workaround? It required much sleuthing to determine, but the method described below gets all data from EUtils, and thus should avoid the CORS issue.

I quickly checked via browsing EUtils API results that the following approach using the little-known GenColl database works straightforwardly not only for Plasmodium falciparum, but also for Homo sapiens and Drosophila melanogaster, unlike several other approaches I tried with the better-known databases Assembly and BioProject.

Get chromosomes for genome, CORS workaround

Parse value of rsuid esummary JSON response, e.g. 360518 in the "Get best genome for organism" section of my previous comment.

(Data recap: the rsuid 360518 is the internal RefSeq UID for the RefSeq genome assembly GCF_000002765.3, i.e. ASM276v1, the latest chromosome-level RefSeq assembly for organism Plasmodium falciparum.)

Get a list of chromosome UIDs linked to Nucleotide (nuccore) from GenColl database for genome assembly 360518:

{
    "header": {
        "type": "elink",
        "version": "0.3"
    },
    "linksets": [
        {
            "dbfrom": "pubmed",
            "ids": [
                360518
            ],
            "linksetdbs": [
                {
                    "dbto": "nuccore",
                    "linkname": "gencoll_nuccore_chr",
                    "links": [
                        296005645,
                        296005143,
                        296004920,
                        258549241,
                        258549170,
                        258549151,
                        258549100,
                        86176855,
                        23957709,
                        23613523,
                        23613362,
                        23613028,
                        23593254,
                        23509994,
                        11466244
                    ]
                }
            ]
        }
    ]
}

Parse links, and join the elements of that array into a comma-delimited string (e.g. ids = links.join(",")).

Pass that string of chromosome UIDs into the id parameter of an ESummary call to the Nucleotide database.

Request: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?retmode=json&db=nucleotide&id=296005645,296005143,296004920,258549241,258549170,258549151,258549100,86176855,23957709,23613523,23613362,23613028,23593254,23509994,11466244

...
"result": {
        "uids": [
            "296005645",
            "296005143",
            "296004920",
            "258549241",
            "258549170",
            "258549151",
            "258549100",
            "86176855",
            "23957709",
            "23613523",
            "23613362",
            "23613028",
            "23593254",
            "23509994",
            "11466244"
        ],
        "296005645": {
            "uid": "296005645",
            "caption": "NC_004331",
            "title": "Plasmodium falciparum 3D7 chromosome 13",
            "extra": "gi|296005645|ref|NC_004331.2||gnl|NCBI_GENOMES|103",
            "gi": 296005645,
            "createdate": "2002/10/03",
            "updatedate": "2010/07/29",
            "flags": 512,
            "taxid": 36329,
            "slen": 2895605,
            "biomol": "genomic",
            "moltype": "dna",
            "topology": "linear",
            "sourcedb": "refseq",
            "segsetsize": "",
            "projectid": "148",
            "genome": "chromosome",
            "subtype": "chromosome",
            "subname": "13",
            "assemblygi": 225631926,
            "assemblyacc": "AL844509",
            "tech": "",
            "completeness": "",
            "geneticcode": "1",
            "strand": "",
            "organism": "Plasmodium falciparum 3D7",
            "strain": "",
            "statistics": [
                {
                    "type": "Length",
                    "count": 2895605

Here subname is the chromosome name and slength is the chromosomes length in base pairs. Iterate over each key-value pair in result (skip uids), parse name and length values for each chromosome, and put them into an array of objects as shown in the "Example output" section in my previous comment.

Notes:

subname may require additional parsing to get the canonical, human-friendly chromosome name. Example: Drosophila chromosomes -- the expected name the first chromosome result there is "3L", but its subname value has lots of noise.
Don't worry about chromosome order or MT (mitochondrial DNA). I can take care of chromosome order unless you are especially interested. Drosophila chromosomes are an example again of a complication -- they are ordered X, 2L, 2R, 3L, 3R, 4, Y, MT. C. elegans chromosomes are ordered I, II, III, IV, X, MT.

eweitz commented 8 years ago

Thanks for the pull request, @Klortho, but @ProjectProgramAMark is correct: I would really prefer this feature to require no server-side code or configuration. This feature should work with a primitive, traditional web server stack, e.g. on a static web page served by Apache or Nginx.

I wonder if the lack of an Access-Control-Allow-Origin HTTP header in https://ftp.ncbi.nlm.nih.gov is a matter of security, or if it's something that simply has not been implemented yet. I suspect it's the latter. Given the proliferation of client-side API calls and unique information available via FTP -- e.g. full sequence reports like https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/All/GCF_000002765.3.assembly.txt are the only machine-friendly resources I know that provide expected chromosome ordering, and conveniently include chromosome name and length -- I think eliminating the CORS restriction on NCBI's side would be widely beneficial.

eweitz commented 8 years ago

@ProjectProgramAMark, if you are still interested in this, please let me know. Otherwise I will begin wiring up data retrieval for this in about a week.

ProjectProgramAMark commented 8 years ago

@eweitz Where would be the ideal place/file to place this script in?

eweitz commented 8 years ago

@ProjectProgramAMark, what you have in d3test/public/index.html is a good start. I recommend continuing that path, perhaps in a fork of this repository. Once you can return names and lengths for chromosomes given only a scientific name like Plasmodium falciparum -- i.e. without hardcoded intermediate values like rsuid "360518" -- ping me and we'll proceed from there.

Ideally, your data retrieval function will ultimately be a single method getChromosomes(organismName) in src/js/ideogram.js, e.g.:

/**
* Returns names and lengths of chromosomes for an organism's best-known genome assembly
*/
Ideogram.prototype.getChromosomes = function(organismName) {
  // Your data retrieval code
}

But let's take this one step at a time. First get a working independent function, then we'll integrate this into src/js/ideogram.js. Integrating will require getting familiar with Ideogram's complex initialization logic, which I can help with when we get there.

Once you have a standalone function getChromosomes(organismName) working without hardcoded values and returning something like the example output from my 6/29 comment, please let me know.

ProjectProgramAMark commented 8 years ago

@eweitz my script should be working now. pull my repo and cd into the directory, run "npm start" and click on the button.

eweitz commented 8 years ago

@ProjectProgramAMark, your data retrieval code looks good so far! Thanks for those instructions. I verified that your current getChromosomes(organismName) function gets chromosome lengths and names for Plasmodium falciparum.

I think the next step is to begin integrating your function into this library, i.e. src/js/ideogram.js.

Please:

Fork this repo
Create a new branch labeled eukaryotes-data-retrieval
Paste your function into src/js/ideogram.js
Modify your function's signature to add getChromosomes as an instance method of Ideogram; see outline in my previous comment.
Open a pull request to merge your forked ideogram repo's eukaryotes-data-retrieval branch into this repo's master branch

If you're feeling ambitious, try calling ideogram.getChromosomes("Plasmodium falciparum") after you install your method on Ideogram's prototype. But don't worry if it doesn't work, or if Ideogram is giving you trouble installing. The main goal is to get a pull request with more or less your current getChromosomes function open, and begin a code review. I can take care of any integration trouble.

In the code review, I'll recommend and make various code updates. Your current function will undergo some transformations, but its essence looks roughly OK at an initial glance.

ProjectProgramAMark commented 8 years ago

@eweitz ok done! Let me know what you want to change.

ProjectProgramAMark commented 8 years ago

@eweitz any time estimation on when this will be finished?

eweitz commented 8 years ago

I hope to have a basic version of this available in a week or two.

ProjectProgramAMark commented 8 years ago

@eweitz any progress?

eweitz commented 8 years ago

@ProjectProgramAMark, yes, slowly but steadily. Integrating the data retrieval code into the larger Ideogram library turned out to require major work that cut across more aspects of Ideogram than expected. See #54 for details.

I'm now beginning on the rendering task of this feature. I will ping you when it's done.

Update: @ProjectProgramAMark, this is done -- see e.g. https://eweitz.github.io/ideogram/eukaryotes.html?org=plasmodium-falciparum. Thanks again for your help!

eweitz commented 8 years ago

@mrouard, @ProjectProgramAMark, rough basic rendering of some arbitrary eukaryotic genomes is in place in the development branch https://github.com/eweitz/ideogram/tree/render-eukaryotic-chromosomes.

That branch can retrieve chromosome data for e.g. Plasmodium falciparum (malaria parasite), Caenorhabditis elegans (worm) and Musa acuminata (banana) given only the organism's scientific name. See examples/worm.html in the render-eukaryotic-chromosomes branch for an example of how this feature will look at the app-developer level.

I'll comment here with a progress update within a week.

eweitz commented 8 years ago

The rendering of eukaryotic chromosomes is significantly better than it was last week. I've also replaced the worm.html example with something more expansive, examples/eukaryotes.html.

My next task is to fix the failing automated test suite. As before, the place to follow progress is https://github.com/eweitz/ideogram/tree/render-eukaryotic-chromosomes.

eweitz commented 8 years ago

This feature is done. Support for eukaryotes can be found at:

https://eweitz.github.io/ideogram/eukaryotes.html

mrouard commented 8 years ago

Looks great @eweitz

Looking at some examples, there is like a bug display https://eweitz.github.io/ideogram/eukaryotes.html?org=arabidopsis-thaliana

There is like a very small additional chromosome. same for maize, rice and grape.

eweitz commented 8 years ago

Thanks for noting that problem, @mrouard. I've opened issue #56 to address it.

eweitz / ideogram

Support all eukaryotes #45

Data retrieval

Get best genome for organism

Get chromosomes for genome

Example input and output

Get chromosomes for genome, CORS workaround