Test fixtures for GWAS plugin

brinkdp commented 2 months ago

We have gotten confirmation that we will soon have new population genomics data incoming. I took this as encouragement to continue the investigation of the GWAS plugin started by @Kwentine in EB-56. Big thanks for the previous progress on this! I mostly just tweaked the config and prepared new test data.

EDIT: one week later, the data has now been made public. Findings from tests of a representative file from the dataset are presented in the next post below.

The initial commit include fixtures with BED-like data and an updated version of the EB-56 config. The result is a working Manhattan plot.

The dummy data is based on the structure of an incoming dataset I got to preview. Since that data is yet to be made public, the fixture data contains a randomized score column with a distribution based on the min-max values of the original file with some outliers sprinkled in for effect.

To test (inspired by PR37):

./scripts/dockermake --test SPECIES=gwas_testing_ground
./scripts/dockerserve --dev
cp tests/fixtures/gwas_testing_ground/gwas_randomized_dummy_data.txt.bgz* hugo/static/data
cp -r tests/data/gwas_testing_ground hugo/static/data

And then open: http://localhost:8080/genome-browser/?config=/data/gwas_testing_ground/config.json

Some observations:

First I attempted to load the plain text file using bedAdapter, but this only resulted in an empty plot, which is similar to the result Quentin previously got with the other test data.
Switched to BedTabixAdapter like in the example in the GWAS plugin repo. Realized/remembered that tabix only accepts tab-delimited data ("What's in a name?" :smile: ). Reformatted with awk '{$1=$1}1' OFS='\t' data.txt > data.tab_delim.txt. BED sort, bgzip, tabix. The data now displayed!
Despite that the official specification for the BED format states that it is whitespace-delimited, tab-delimited BEDs are commonly used and are likely the easiest way to ensure compatibility with JBrowse 2. From now on, we will ask submitting researchers to tab-delimit their BED files
BED-like files can have headers: first line need to start with # and be tab-separated. The column titles can be used to define e.g. the scoreColumn for the plot.
Take-home message: ask researchers to tab-delimit their BED files.

Some things left to consider:

Can BED-like files be used for other visualizations? Manhattan plots will not be suitable for all population genomics datasets. Histograms have already been requested.
Should the GWAS plugin be installed in the repo instead of called remotely?
Will the bedAdapter config work with the tab-delimited data?
The FASTA has a file size of 2 Mb despite being truncated. The config can be altered to call on the remote data instead, but a refNameAlias first need to be prepared.

I put this a a draft PR for now because I anticipate that the incoming dataset might have additional data types that need to be tested.

brinkdp commented 2 months ago

The real data has now been made public. It is 16 new tracks each for the two Linum spp. So as an experiment, I've applied the findings from the initial commit of the PR to tests/configs/linum_popgen.

A main realization was that grep -v "^#" | had to be dropped from the filter script in order to preserve the header for the bgzipped file so that it can be properly called in tracks.adapter.scoreColumn in config.json. The omission of the grep operation should not affect header-less BEDs from what I can tell: testing with the BED in tests/fixtures/mito_krill was succesful.

The configs in f542970 are a hack to test that the new files work with the plugin. Since config.yml takes precedence over config.json upon running make, this experiment creates a redundant track named Placeholder_Lten_pop06_TD that only serves to handle downloading and indexing of the new data type via config.yml. The actual Manhattan plot track in the initial config is set to point to the resulting files (Lten_pop06_TD.bed.bgz and Lten_pop06_TD.bed.bgz.tbi) but uses a different trackId and name to avoid being overwritten by add-track. The final track has some quirks (see note below), but the outcome is nevertheless neat. I can recommend changing to "show all regions in assembly" in the view menu for a useful display.

With these findings, I would like to pass the baton to @kwentine to work out the details to best implement this in the back-end. Perhaps the schema for config.yml can be expanded to pass on "adapter": {"scoreColumn": "[HEADER_NAME]"}, {"displays":[{"displayId":"[FILENAME]_display","type":"LinearManhattanDisplay"}]}, and, for aestetical flair, "category": ["[TITLE]"] to the final config.json?

Note: The organelle scaffolds, such as the first scaffold (ENA|CAMGYJ010000001|CAMGYJ010000001.1) lack annotations for the new track. However, even with defaultSession configured to start on the chromosomal scaffold ENA|CAMGYJ010000002|CAMGYJ010000002.1, the Manhattan plot track does not render the data. Manually selecting a scaffold, (even re-initating ENA|CAMGYJ010000002|CAMGYJ010000002.1) will make it render, though. At the moment, I do know if this bug comes from the GWAS plugin, or the initial config. It could very well be the latter, since I did not notice this behavior in the test fixture of the initial commit.

Commands:

./scripts/dockerserve --dev
./scripts/dockermake --test
cp -r tests/data/linum_popgen hugo/static/data

http://localhost:8080/genome-browser/?config=%2Fdata%2Flinum_popgen%2Fconfig.json

kwentine commented 1 month ago

@brinkdp First of all I must say that I am impressed by your clever workarounds, and always very clear summary.

For complex configurations, such as the Manhattan plot, I think we should directly craft config.json, partly because the add-track CLI does not allow to specify adapter options such as scoreColumn. Other track-level advanced options could be passed to add-track as a JSON string, but if we are to write JSON I reckon we might as well edit config.json directly.

That being said, the backend should still prepare the data for tracks manually added to config.json So I propose that we add an optional skipAddTrack key to config.yaml that, when set to true, would instruct make build to generate track files as usual, but skip the add-track call.

What do you think ?

brinkdp commented 1 month ago

You raise good points; a skipAddTrack key in combination with manual crafting of config.json would be an easy way to proceed with this. Please go ahead with that idea!

There is indeed an elevated complexity in this config. This got me thinking about how to best transfer this knowledge to new staff members in the future, should the need arise. Some documentation or perhaps even helper scripts would be needed. I will ruminate on this.

kwentine commented 1 month ago

Great, I will go ahead an implement in one or more separate branches:

skipAddTrack
header-preserving sorting of BED files
tabifying of BED files

When all these are reviewed and merged, I will help you merge back the changes in this branch.

Does that sound good ? The alternative would be for me to implement everything in this very branch, but I fear it may mix too many concerns.

brinkdp commented 1 month ago

Great, sounds like a good plan!

kwentine commented 1 month ago

Work should be continued on the re-based branch gwas-test-continued (#53)

ScilifelabDataCentre / genome-portal

Test fixtures for GWAS plugin #43