GUI: BLAST Search - Githubissues

rodichenko commented 3 years ago

BLAST search

Let's introduce BLAST search feature support.

Displaying BLAST search results for read/feature sequence:

User clicks on read (or any other feature) and selects BLAST search option:
BLAST panel is displayed with corresponding feature sequence and BLAST search results:
User can navigate to each hit by clicking on it:

Displaying BLAST search results for any sequence:

User may also specify any sequence in the text input and click Search button to perform BLAST search.

Displaying whole genome view:

BLAST search results may be displayed as a Whole genome view:

This view contains per-chromosome results (highlighted by strong/weak or other criteria, TBD).

NShaforostov commented 3 years ago

Details

Start BLAST search
The BLAST panel
- Search settings
- Search history
BLAST Search results

BLAST search should allow the following types of search:

search of the nucleotide sequence over BLAST databases
search of the protein sequence over BLAST databases

To start BLAST search

There are the following ways to start search:

to search the existing sequence using in the current dataset, user can click a read (or any other feature) at any track (except VCF) and select "BLAST search" option from the context menu: To search over nucleotide databases using blastn tool - the item "BLASTn Search" shall be selected. To search over protein databases using blastp tool - the item "BLASTp Search" shall be selected. Please note, that these settings of the search can be changed further. Some additional details:
- for the BAM tracks, the "BLASTp Search" item shall be invisible in the context menu
- for the well-structed gene tracks, if user clicks the exon - the "BLASTn Search" item shall contain two sub-items - "Exon only", "All transcript info", e.g.:
or to search of any sequence type, user can open a new BLAST panel and specify manually all desired search settings

In both cases, the BLAST panel should be opened. It is placed as other additional panels - at the right side and should have the following content, e.g.:

If the BLAST panel was opened by click a read/feature/gene, the corresponding sequence (according to the selected in GUI) should appear in the Query sequence field and the corresponding search tool (blastn/blastp) should be selected.

If the read/feature/gene selected at the GUI for the search is too large to be inserted into the Query sequence field - user shall receive the corresponding error message, e.g. "Selected feature is too large to be used as a query sequence." The BLAST panel should not be opened in such case. The maximum possible length of the sequence that could be inserted into Query sequence field shall be defined in System settings that stored on the server side and could be changed only by system administrator. System administrator shall have the ability to change a value of this setting "on-the-fly", without app redeployment.

The BLAST panel

The BLAST panel should contain two sub-tabs:

Search - to display and specify search settings, and also start a new search
History - to display the history of searches

Search settings

In the Search sub-tab, the following elements should be:

selector of the TOOL that will be used for the search (mandatory). Only one tool can be selected:
- blastn - for the search over nucleotide databases using a nucleotide query. This tool shall be selected by default, if user clicked the "BLASTn Search" for any read/feature/gene on the GUI
- blastp - for the search over protein databases using a protein query. This tool shall be selected by default, if user clicked the "BLASTp Search" for any read/feature/gene on the GUI
- blastx - for the search over protein databases using a translated nucleotide query
- tblastn - for the search over translated nucleotide databases using a protein query
- tblastx - for the search over translated nucleotide databases using a translated nucleotide query
the QUERY SEQUENCE (mandatory section):
- here the existing sequence should be automatically inserted into the text field - in case of the first way described above - when user had selected the existing sequence in the browser (e.g. read/feature/gene/exon/transcript).
- or user can specify/edit the sequence manually
- or user can use a file with sequence instead text field. For that:
- set the corresponding checkbox (see mockup) - in this case, text field for a query should become disabled
- click the "Choose file" button and select the file with a sequence from the local workstation
- for an upload file, the file name should be appear near the "Choose file" button
- user can click the "Choose file" button again and select another file if needs
- the common flow of the search sequence definition should be used:
- if the "upload" checkbox is disabled and the text field contains the sequence - this sequence shall be used for a search
- if the "upload" checkbox is disabled and the text field is empty - the search shall not be performed
- if the "upload" checkbox is enabled and the sequence file is chosen - this sequence file shall be used for a search
- if the "upload" checkbox is enabled and no sequence file is chosen - the search shall not be performed
text field for the search task title (optional field). In this field, user may specify the title for the current search operation - to easier find it later, e.g.:
search "set" (mandatory section). Here user can specify where the search should be performed:
- DATABASE dropdown list - for the select of the BLAST database (mandatory), e.g.: Displayed list of databases should correspond one of the types - protein or nucleotide. This type is defined by the selected tool type at the TOOL selector: for blastn, tblastn, tblastx - nucleotide databases should be displayed, for blastp, blastx - protein databases should be displayed
- ORGANISM dropdown list - for the select/specifying the species for which the search should be performed (optional)
- checkbox to exclude the certain organism(s) from the list for which the search should be performed (unchecked by default)
- this dropdown list shall support multi-select. If there are several species selected in the dropdown list and the Exclude checkbox is set - all selected organisms shall be excluded from the search
ALGORITHM dropdown list (mandatory) - list of algorithms available for the current tool:
- for the blastn tool, there should be:
- megablast (highly similar sequences) - default value
- discontiguous megablast (more dissimilar sequences)
- blastn (somewhat similar sequences)
- blastn-short (optimized for sequences less than 30 nucleotides)
- for the blastp tool, there should be:
- blastp (protein-protein BLAST) - default value
- blastp-short (optimized for query sequences shorter than 30 residues)
- blastp-fast (faster version that uses a larger word-size)
- for the blastx tool, there should be:
- blastx (search proteins using a translated nucleotide query) - default value
- blastx-fast (faster version that uses a larger word-size)
- for the tblastn tool, there should be:
- tblastn (search translated nucleotide databases using a protein query) - default value
- tblastn-fast (faster version that uses a larger word-size)
- for the tblastx tool, this dropdown list should be unavailable (invisible)
"Additional parameters" section - collapsible section to specify additional "technical" BLAST parameters that should be used for the search. Expanded view of that section: Where:
- Max target sequences (integer, greater than 0) field allows to set the maximum number of aligned sequences to display in results (CLI BLAST option max_target_seqs)
- Expect threshold (real, greater than 0) field allows to set expect value for saving hits (CLI BLAST option evalue)
- Options field allows to specify additional blast options in CLI style. If user will try to specify such options in incorrect format or with incorrect names - they should be ignored during the search
- Max target sequences and Expect threshold settings should have default values. These values shall be defined in System settings that stored on the server side and could be changed only by system administrator. System administrator shall have the ability to change values of these settings "on-the-fly", without app redeployment. So, if user doesn't forcibly specify values for these parameters (e.g., even doesn't expand the "Additional parameters" section) - their default values shall be used for the search. As "base" default values, I suggest: Max target sequences - 100, Expect threshold - 0.05
Search button - to start a search

Some details:

Databases list shall be being automatically loaded and synchronized with the certain inner NGB Database folder (set in server settings). NGB CLI shall provide commands for adding and removing databases into NGB Database list. Databases in the list should be on two sub-lists - one for uploaded "custom" genomes and one for "purely" BLAST databases
Organisms list should be fetched from species from NCBI taxonomy database

Search history

Since the BLAST search can take a long time, the results are not displayed immediately - and for each search (after each click of the Search button) - a new "search task" creates. The list of such tasks is displayed at the "History" sub-tab. This sub-tab should be opened automatically after the search starts and should have view like:

Here the table shall be displayed that contains all user's BLAST search requests, where:

Task ID - automatic created ID of the certain search task
Task title - title of the certain search task (if it was specified before the search)
Current state - status of the search task. I suppose, it could be:
- "Searching" - for task being performed at the moment
- "Done" - for task successfully finished
- "Interrupted" - for task canceled during the searching
- "Failure" - for failed task (task finished with errors)
Submitted at - date and time when the certain search was started
Duration - duration of the certain search task
block of the controls near each request:
- for the task that is being currently performed:
- button to cancel search (conventionally shown by a cross-button). This button allows to break the search and change the state of that task to "Interrupted"
- button to open the search again in the "Search" sub-tab (conventionally shown by a reverse arrow-button). This button should open the "Search" sub-tab and configure the search parameters in the same values as they were in the current request
- for the task that was performed:
- button to open the search again in the "Search" sub-tab (conventionally shown by a reverse arrow-button). This button should open the "Search" sub-tab and configure the search parameters in the same values as they were in the current request
Clear history button - to clear all BLAST search history of the user (remove all searches from the history). This button also should cancel all searches performing at the moment and simultaneously remove them from the history list too.

The table:

should have the auto-refresh every 5 sec, e.g. (only if the tab is opened)
should be sorted by Submitted at column (from the newer tasks to older ones)
should have pagination and vertical scroll (only if this is required according to screen size)

If the certain search is finished, its state changes to "Done" and task ID becomes a hyperlink. User can click such hyperlink to open the corresponding search results. Search results should be opened in the same tab ("History").

Search results

When user clicks the hyperlink of the certain search results, the corresponding results should be opened in the same tab, e.g.:

This form shall contain:

Back button - to hide current results and return to the search history view
header with the opened search task title (if it was specified before the search). If the task title wasn't specified, there nothing should be shown
Blast parameters collapsible section (collapsed by default). This section should contain details (parameters/options) of the opened search:
- Query info button - to open search query details:
- this button shall open the pop-up with the corresponding sequence and label with its length, e.g.:
- if the sequence was uploaded from the local file - the sequence itself shall not be shown in the pop-up, there shall be shown the sequence file name as hyperlink - when click it, the corresponding file shall be downloaded:
- Used tool - tool name that used for the search
- Submitted at - date and time when the certain search was started
- Database - database name that used for the search
- Organisms - list of organisms for which the search was performed (if they were specified before the search)
- Algorithm - algorithm name that used for the search
- "Additional parameters" section - collapsible section where user can view additional "technical" BLAST parameters that were used for the search (same that were configured before the search - see details above)
block of additional buttons at the right-upper corner of the form:
- Edit search button - to open the search again. This button should close the results pop-up, open the "Search" sub-tab of the "BLAST" panel and configure the search parameters in the same values as they were in the current request
- if the sequence was uploaded from the local file in the "origin" search - it shall be displayed in the corresponding way, the sequence itself shall not be shown, e.g.:
- Download results button - to download full BLAST search results (raw) as CSV file to the local workstation
Sequences table - to show the search results summary grouped by sequences

In case, when the search was performed successfully but there are no results, the corresponding message should be displayed instead of the results table, Download results button should be hidden, e.g.:

In case, when the search was finished with the "Failure" state, the corresponding message should be displayed instead of the results table and contain the short info about the error (if it's available), Download results button should be hidden, e.g.:

Sequences table

This table contains search results. Table should contain not raw blast search results but aggregated results grouped by their sequences. Table shall contain the following columns:

Sequence ID - IDs of the sequences in which hits were found. Each ID should be a hyperlink to a corresponding sequence page on NCBI (if exists)
Organism - organism specified in the sequence
TaxID - taxonomy ID specified in the sequence
Max score - the highest alignment score from all matches of the search query to the certain sequence
Total score - sum of alignment scores from all matches of the search query on the certain sequence
Query cover - the percent of the query length that is included in the aligned segments
E value - the number of alignments expected by chance with the calculated score or better. By this column the default sorting should be (ascending)
Percent identity - the highest percent identity for a set of aligned segments to the same subject sequence
Matches - number of matches of the search query to the certain sequence

Maximal count of rows in this table is defined by Max target sequences parameter that is specified before the search (see above). User should have the ability to sort this table by any column and manually configured the column order.

Alignments info

When user clicks the certain row (sequence) in sequence table, the form with details about all matches (alignments) of the search query to the certain sequence shall be opened. "Alignments" form should be opened in the same tab ("History"), e.g.:

This form shall contain:

Back button - to hide "Alignments" form and return to the sequences table view
Sequence ID and its Definition (if it's specified for the sequence). ID should be a hyperlink to a corresponding sequence page on NCBI (if exists)
Details about all matches of the search query to the current sequence (match blocks)

Each match block should contain:

range (positions) of the current sequence where the match is defined
"View at track" link - hyperlink to view the certain match (alignment) to the current sequence (see details below). This hyperlink shall be visible only for those sequences which reference files are in NGB databases
score and bit-score
expect value (E-value)
count of identities between sequences (by symbols) and its percent value
count of gaps (by symbols) and its percent value
only for protein sequences: count of positives and its percent value. Positives indicate a conservative substitution number (or substitutions that are often observed in related proteins)
block with the conventional figure of the query string alignment to the current sequence segment:
- start and end position of the query string segment
- aligned query string segment
- start and end position of the current sequence segment
- the current sequence segment to which the query string segment was aligned
- only for nucleotide sequences: strands of each sequence (query and subject) - plus or minus
- only for nucleotide sequences: symbols that "link" the corresponding letters in both sequences:
- straight line if letters are equal
- nothing (empty) if letters are not equal (mismatch)
- minus symbol ("-") in any sequence - for gaps Example:
- only for protein sequences: "linking" sequence between the corresponding letters in both sequences:
- the amino acid symbol (letter) itself, if this amino acid is the same for both lines (query and sequence)
- nothing (empty) if amino acids are not equal (mismatch)
- minus symbol ("-") in any sequence - for gaps Example:

Alignment on track

User should have the ability to view any found match (alignment) at a track (graphic visualization) in the Browser panel. To open the visualization form, user should click the "View at track" link near the match in the "Alignments" form.

An alignment of the nucleotide sequence, opened by such a hyperlink, should look something like:

Example above is shown for the following match block:

The workflow should be the following - user clicks the hyperlink "View at track" and:

if the current opened reference/chromosome in the Browser panel is distinct from the reference/chromosome to which the sequence from the BLAST results belongs:
- current opened reference/chromosome and all related tracks shall be closed
- in the Browser panel, the reference/chromosome to which the sequence from the BLAST results belongs shall be opened:
- if for that reference, there are gene annotation files in NGB, their tracks shall be automatically opened too
- in the Browser panel, additional track for the search query shall appear
if the current opened reference and chromosome in the Browser panel are the same reference and chromosome to which the sequence from the BLAST results belongs:
- in the Browser panel, only additional track for the search query shall appear
if user will click the hyperlink "View at track" for another alignment but on the same reference that is already opened - another one additional track for the search query alignment should be opened (i.e. separate track for each clicked aligned range of the same subject reference)
all other behavior of the Browser panel should stay as currently implemented

So, already opened additional tracks for the search query alignments will be automatically closed in any of these cases:

user clicks the hyperlink "View at track" for an alignment related to a reference distinct from the currently opened
user selects another chromosome for displaying in the Browser panel
user selects in the "Datasets" panel a dataset with a reference distinct from the currently opened

Details of the visualization:

at the reference track, by default, the strand according to the strand of the subject sequence of the certain alignment should be selected
for the search query - track that shows alignment of the search query to the sequence:
- on a track, only search query shall be shown
- track header shall contain the type of the query ("Nucleotide query" or "Protein query")
- track shall be displayed in the similar manner as a single read at the ALIGNMENT track in the "Browser", in which:
- query matches - should be shown as a gray line by width of aligned matches -
- query strand - should be shown as an arrow on the edge of the query line -
- mismatches - should be shown as separate color rectangles with the corresponding letters -
- gaps in sequence - should be shown as insertions (a perpendicular violet line in the gap position) -
- gaps in query - should be shown as deletions (a black line linked two "separate" parts of the query) -
- additionally, near the each end of the query line at the track, should be conventionally shown counts of the query positions that were not aligned to sequence - . For example above, first 3 symbols and last 2 symbols of the query were not aligned
user shall have the ability to move query track among other tracks
by default, the reference scale shall be set to show the full query line in maximum possible zoom

An alignment of the protein sequence, opened by such a hyperlink, should look something like:

Example above is shown for the following match block:

Additional details of the protein queries visualization:

at the reference track, by default, the strand and translation according to the strand of the subject sequence of the certain alignment should be selected
for the protein query track:
- query matches, strand, gaps should be shown similar to nucleotide query but all shown parts (elements) should have lengths proportionally to amino acid rectangle
- mismatches - should be shown as separate color rectangles with the corresponding letters - . For all mismatches, single highlighting color should be used

mzueva commented 3 years ago

@rodichenko @DmitriiKrasnov I'd suggest to return alignments using BTOP field

The “Blast trace-back operations” (BTOP) string describes the alignment produced by BLAST. This string is similar to the CIGAR string produced in SAM format, but there are important differences. BTOP is a more flexible format that lists not only the aligned region but also matches and mismatches. BTOP operations consist of 1.) a number with a count of matching letters, 2.) two letters showing a mismatch (e.g., “AG” means A was replaced by G), or 3.) a dash (“-“) and a letter showing a gap. The box below shows a blastn run first with BTOP output and then the same run with the BLAST report showing the alignments.

Prot:
qseq=LCGRGFIRA
sseq=VCTREYVRE
btop=LV1GT1GEFYIV1AE

Nucl
qseq=GGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCAGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCCCATAAAGCTGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGCTTAACCGCAAGAGGGCGATCACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAAACTGCGGCTGG
sseq=GGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCAGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCCCATAAAGCTGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGCTTAACCTTCGGGAGGGCGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCCTAGGGGAACCTGCGGCTGG
btop=317GT-T1AGAG7AT1CT46GC8AC10

NShaforostov commented 3 years ago

...

Browse genome view button - to open the form with the search results on a whole genome. This button should be enabled only in cases when in results there are sufficient details about distribution of found hits to different (several) chromosomes of the same organism ...

Genome view

TBD

Note: not for all BLAST search results a "genome view" can be built. This can be possible when in results there are sufficient details about distribution of found hits to different (several) chromosomes of the same organism

If the Browse genome view button is clicked, the corresponding view should be opened:

Genome view should be built based on raw blast search results that contain all found hits (all matches of the search query to different sequences).

Here the distribution of found hits to different (several) chromosomes of the same organism shall be displayed. The form should contain:

Species dropdown list - if there are several organisms in results, for which the genome view can be built - by this dropdown, user can select the desired organism
chromosome length axis
conditional image of the distribution of hits to each of the chromosomes (only for found chromosomes). Each chromosome shall contain:
- name (according to the chromosome name from the search results)
- barcharts of the hits scores distributed along the certain chromosome according to ranges of the found hits at the sequence segments. Hit barcharts should be placed along the chromosome according their positions that can be received from raw results. Note that for any chromosome there can be several sequences in the search results

mzueva commented 3 years ago

@DmitriiKrasnov @rodichenko A new API method is added to get BLAST results: GET /restapi/task/{taskId}/group. It returns results grouped by sequence objects (results table view) along with alignments (alignments view).

Mapping for views:

Results table

Sequence ID - sequenceId Organism - organism TaxID - taxId Max score - maxScore Total score - totalScore Query cover - queryCoverage E value - evalue Percent identity - percentIdentity Matches - matches

Alignments view

Regarding NCBI references:

nucleotide sequences https://www.ncbi.nlm.nih.gov/nucleotide/{sequenceId}
protein sequences https://www.ncbi.nlm.nih.gov/protein/{sequenceId}

mzueva commented 3 years ago

@DmitriiKrasnov @rodichenko @NShaforostov Default BLAST setting are optionally available using restapi/defaultTrackSettings method under blast_settings key:

 "blast_settings": {
      "query_max_lenth": 1024,
      "max_target_seqs": 100,
      "evalue": 0.001
    }

epam / NGB

GUI: BLAST Search #381

BLAST search

Details

To start BLAST search

The BLAST panel

Search settings

Search history

Search results

Sequences table

Alignments info

Alignment on track

Genome view

Results table

Alignments view