epam / NGB

New Genome Browser (NGB) - a Web - based NGS data viewer with unique Structural Variations (SVs) visualization capabilities, high performance, scalability, and cloud data support
MIT License
161 stars 41 forks source link

GUI: BLAST Search #381

Open rodichenko opened 3 years ago

rodichenko commented 3 years ago

BLAST search

Let's introduce BLAST search feature support.

Displaying BLAST search results for read/feature sequence:

Displaying BLAST search results for any sequence:

User may also specify any sequence in the text input and click Search button to perform BLAST search.

Displaying whole genome view:

BLAST search results may be displayed as a Whole genome view:

whole-genome-view

This view contains per-chromosome results (highlighted by strong/weak or other criteria, TBD).

NShaforostov commented 3 years ago

Details

BLAST search should allow the following types of search:

There are the following ways to start search:

In both cases, the BLAST panel should be opened. It is placed as other additional panels - at the right side and should have the following content, e.g.: image

If the BLAST panel was opened by click a read/feature/gene, the corresponding sequence (according to the selected in GUI) should appear in the Query sequence field and the corresponding search tool (blastn/blastp) should be selected.

If the read/feature/gene selected at the GUI for the search is too large to be inserted into the Query sequence field - user shall receive the corresponding error message, e.g. "Selected feature is too large to be used as a query sequence." The BLAST panel should not be opened in such case. The maximum possible length of the sequence that could be inserted into Query sequence field shall be defined in System settings that stored on the server side and could be changed only by system administrator. System administrator shall have the ability to change a value of this setting "on-the-fly", without app redeployment.

The BLAST panel

The BLAST panel should contain two sub-tabs:

Search settings

In the Search sub-tab, the following elements should be:

Some details:

  1. Databases list shall be being automatically loaded and synchronized with the certain inner NGB Database folder (set in server settings). NGB CLI shall provide commands for adding and removing databases into NGB Database list. Databases in the list should be on two sub-lists - one for uploaded "custom" genomes and one for "purely" BLAST databases
  2. Organisms list should be fetched from species from NCBI taxonomy database

Search history

Since the BLAST search can take a long time, the results are not displayed immediately - and for each search (after each click of the Search button) - a new "search task" creates. The list of such tasks is displayed at the "History" sub-tab. This sub-tab should be opened automatically after the search starts and should have view like: image

Here the table shall be displayed that contains all user's BLAST search requests, where:

The table:

If the certain search is finished, its state changes to "Done" and task ID becomes a hyperlink. User can click such hyperlink to open the corresponding search results. Search results should be opened in the same tab ("History").

Search results

When user clicks the hyperlink of the certain search results, the corresponding results should be opened in the same tab, e.g.: image

This form shall contain:

In case, when the search was performed successfully but there are no results, the corresponding message should be displayed instead of the results table, Download results button should be hidden, e.g.: image

In case, when the search was finished with the "Failure" state, the corresponding message should be displayed instead of the results table and contain the short info about the error (if it's available), Download results button should be hidden, e.g.: image

Sequences table

This table contains search results. Table should contain not raw blast search results but aggregated results grouped by their sequences. Table shall contain the following columns:

Maximal count of rows in this table is defined by Max target sequences parameter that is specified before the search (see above). User should have the ability to sort this table by any column and manually configured the column order.

Alignments info

When user clicks the certain row (sequence) in sequence table, the form with details about all matches (alignments) of the search query to the certain sequence shall be opened. "Alignments" form should be opened in the same tab ("History"), e.g.: image

This form shall contain:

Each match block should contain:

Alignment on track

User should have the ability to view any found match (alignment) at a track (graphic visualization) in the Browser panel. To open the visualization form, user should click the "View at track" link near the match in the "Alignments" form.

An alignment of the nucleotide sequence, opened by such a hyperlink, should look something like: image

Example above is shown for the following match block: image

The workflow should be the following - user clicks the hyperlink "View at track" and:

So, already opened additional tracks for the search query alignments will be automatically closed in any of these cases:

Details of the visualization:

An alignment of the protein sequence, opened by such a hyperlink, should look something like: image

Example above is shown for the following match block:

image

Additional details of the protein queries visualization:

mzueva commented 3 years ago

@rodichenko @DmitriiKrasnov I'd suggest to return alignments using BTOP field

The “Blast trace-back operations” (BTOP) string describes the alignment produced by BLAST. This string is similar to the CIGAR string produced in SAM format, but there are important differences. BTOP is a more flexible format that lists not only the aligned region but also matches and mismatches. BTOP operations consist of 1.) a number with a count of matching letters, 2.) two letters showing a mismatch (e.g., “AG” means A was replaced by G), or 3.) a dash (“-“) and a letter showing a gap. The box below shows a blastn run first with BTOP output and then the same run with the BLAST report showing the alignments.

Prot:
qseq=LCGRGFIRA
sseq=VCTREYVRE
btop=LV1GT1GEFYIV1AE

Nucl
qseq=GGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCAGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCCCATAAAGCTGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGCTTAACCGCAAGAGGGCGATCACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAAACTGCGGCTGG
sseq=GGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCAGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCCCATAAAGCTGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGCTTAACCTTCGGGAGGGCGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCCTAGGGGAACCTGCGGCTGG
btop=317GT-T1AGAG7AT1CT46GC8AC10
NShaforostov commented 3 years ago

...

Genome view

TBD

Note: not for all BLAST search results a "genome view" can be built. This can be possible when in results there are sufficient details about distribution of found hits to different (several) chromosomes of the same organism

If the Browse genome view button is clicked, the corresponding view should be opened: image

Genome view should be built based on raw blast search results that contain all found hits (all matches of the search query to different sequences).

Here the distribution of found hits to different (several) chromosomes of the same organism shall be displayed. The form should contain:

mzueva commented 3 years ago

@DmitriiKrasnov @rodichenko A new API method is added to get BLAST results: GET /restapi/task/{taskId}/group. It returns results grouped by sequence objects (results table view) along with alignments (alignments view).

Mapping for views:

Results table

Sequence ID - sequenceId Organism - organism TaxID - taxId Max score - maxScore Total score - totalScore Query cover - queryCoverage E value - evalue Percent identity - percentIdentity Matches - matches

Alignments view

image

Regarding NCBI references:

mzueva commented 3 years ago

@DmitriiKrasnov @rodichenko @NShaforostov Default BLAST setting are optionally available using restapi/defaultTrackSettings method under blast_settings key:

 "blast_settings": {
      "query_max_lenth": 1024,
      "max_target_seqs": 100,
      "evalue": 0.001
    }