Open rodichenko opened 3 years ago
BLAST search should allow the following types of search:
There are the following ways to start search:
VCF
) and select "BLAST search" option from the context menu:
To search over nucleotide databases using blastn
tool - the item "BLASTn Search" shall be selected.
To search over protein databases using blastp
tool - the item "BLASTp Search" shall be selected.
Please note, that these settings of the search can be changed further.
Some additional details:
In both cases, the BLAST panel should be opened. It is placed as other additional panels - at the right side and should have the following content, e.g.:
If the BLAST panel was opened by click a read/feature/gene, the corresponding sequence (according to the selected in GUI) should appear in the Query sequence field and the corresponding search tool (blastn
/blastp
) should be selected.
If the read/feature/gene selected at the GUI for the search is too large to be inserted into the Query sequence field - user shall receive the corresponding error message, e.g. "Selected feature is too large to be used as a query sequence." The BLAST panel should not be opened in such case. The maximum possible length of the sequence that could be inserted into Query sequence field shall be defined in System settings that stored on the server side and could be changed only by system administrator. System administrator shall have the ability to change a value of this setting "on-the-fly", without app redeployment.
The BLAST panel should contain two sub-tabs:
In the Search sub-tab, the following elements should be:
blastn
- for the search over nucleotide databases using a nucleotide query. This tool shall be selected by default, if user clicked the "BLASTn Search" for any read/feature/gene on the GUIblastp
- for the search over protein databases using a protein query. This tool shall be selected by default, if user clicked the "BLASTp Search" for any read/feature/gene on the GUIblastx
- for the search over protein databases using a translated nucleotide querytblastn
- for the search over translated nucleotide databases using a protein querytblastx
- for the search over translated nucleotide databases using a translated nucleotide queryblastn
, tblastn
, tblastx
- nucleotide databases should be displayed, for blastp
, blastx
- protein databases should be displayedblastn
tool, there should be:blastp
tool, there should be:blastx
tool, there should be:tblastn
tool, there should be:tblastx
tool, this dropdown list should be unavailable (invisible)CLI
BLAST option max_target_seqs
)CLI
BLAST option evalue
)CLI
style. If user will try to specify such options in incorrect format or with incorrect names - they should be ignored during the search100
, Expect threshold - 0.05
Some details:
CLI
shall provide commands for adding and removing databases into NGB Database list. Databases in the list should be on two sub-lists - one for uploaded "custom" genomes and one for "purely" BLAST databasesNCBI
taxonomy databaseSince the BLAST search can take a long time, the results are not displayed immediately - and for each search (after each click of the Search button) - a new "search task" creates. The list of such tasks is displayed at the "History" sub-tab. This sub-tab should be opened automatically after the search starts and should have view like:
Here the table shall be displayed that contains all user's BLAST search requests, where:
The table:
If the certain search is finished, its state changes to "Done" and task ID becomes a hyperlink. User can click such hyperlink to open the corresponding search results. Search results should be opened in the same tab ("History").
When user clicks the hyperlink of the certain search results, the corresponding results should be opened in the same tab, e.g.:
This form shall contain:
In case, when the search was performed successfully but there are no results, the corresponding message should be displayed instead of the results table, Download results button should be hidden, e.g.:
In case, when the search was finished with the "Failure" state, the corresponding message should be displayed instead of the results table and contain the short info about the error (if it's available), Download results button should be hidden, e.g.:
This table contains search results. Table should contain not raw blast search results but aggregated results grouped by their sequences. Table shall contain the following columns:
Maximal count of rows in this table is defined by Max target sequences parameter that is specified before the search (see above). User should have the ability to sort this table by any column and manually configured the column order.
When user clicks the certain row (sequence) in sequence table, the form with details about all matches (alignments) of the search query to the certain sequence shall be opened. "Alignments" form should be opened in the same tab ("History"), e.g.:
This form shall contain:
Each match block should contain:
plus
or minus
User should have the ability to view any found match (alignment) at a track (graphic visualization) in the Browser panel. To open the visualization form, user should click the "View at track" link near the match in the "Alignments" form.
An alignment of the nucleotide sequence, opened by such a hyperlink, should look something like:
Example above is shown for the following match block:
The workflow should be the following - user clicks the hyperlink "View at track" and:
So, already opened additional tracks for the search query alignments will be automatically closed in any of these cases:
Details of the visualization:
An alignment of the protein sequence, opened by such a hyperlink, should look something like:
Example above is shown for the following match block:
Additional details of the protein queries visualization:
@rodichenko @DmitriiKrasnov I'd suggest to return alignments using BTOP field
The “Blast trace-back operations” (BTOP) string describes the alignment produced by BLAST. This string is similar to the CIGAR string produced in SAM format, but there are important differences. BTOP is a more flexible format that lists not only the aligned region but also matches and mismatches. BTOP operations consist of 1.) a number with a count of matching letters, 2.) two letters showing a mismatch (e.g., “AG” means A was replaced by G), or 3.) a dash (“-“) and a letter showing a gap. The box below shows a blastn run first with BTOP output and then the same run with the BLAST report showing the alignments.
Prot:
qseq=LCGRGFIRA
sseq=VCTREYVRE
btop=LV1GT1GEFYIV1AE
Nucl
qseq=GGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCAGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCCCATAAAGCTGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGCTTAACCGCAAGAGGGCGATCACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGGGAAACTGCGGCTGG
sseq=GGTAATGCCGGGAACTTTAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAGTAGGGCTACACACGTGCTACAATGGTCAGTACAGAGGGTTGCGAAGCCGCGAGGTGGAGCTAATCCCATAAAGCTGGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGCACCAGAAGTAGATAGCTTAACCTTCGGGAGGGCGTTTACCACGGTGTGGTTCATGACTGGGGTGAAGTCGTAACAAGGTAGCCCTAGGGGAACCTGCGGCTGG
btop=317GT-T1AGAG7AT1CT46GC8AC10
...
TBD
Note: not for all BLAST search results a "genome view" can be built. This can be possible when in results there are sufficient details about distribution of found hits to different (several) chromosomes of the same organism
If the Browse genome view button is clicked, the corresponding view should be opened:
Genome view should be built based on raw blast search results that contain all found hits (all matches of the search query to different sequences).
Here the distribution of found hits to different (several) chromosomes of the same organism shall be displayed. The form should contain:
@DmitriiKrasnov @rodichenko A new API method is added to get BLAST results: GET /restapi/task/{taskId}/group
. It returns results grouped by sequence
objects (results table view) along with alignments (alignments view).
Mapping for views:
Sequence ID
- sequenceId
Organism
- organism
TaxID
- taxId
Max score
- maxScore
Total score
- totalScore
Query cover
- queryCoverage
E value
- evalue
Percent identity
- percentIdentity
Matches
- matches
Regarding NCBI references:
https://www.ncbi.nlm.nih.gov/nucleotide/{sequenceId}
https://www.ncbi.nlm.nih.gov/protein/{sequenceId}
@DmitriiKrasnov @rodichenko @NShaforostov
Default BLAST setting are optionally available using restapi/defaultTrackSettings
method under blast_settings
key:
"blast_settings": {
"query_max_lenth": 1024,
"max_target_seqs": 100,
"evalue": 0.001
}
BLAST search
Let's introduce BLAST search feature support.
Displaying BLAST search results for read/feature sequence:
User clicks on read (or any other feature) and selects BLAST search option:
BLAST panel is displayed with corresponding feature sequence and BLAST search results:
User can navigate to each hit by clicking on it:
Displaying BLAST search results for any sequence:
User may also specify any sequence in the text input and click Search button to perform BLAST search.
Displaying whole genome view:
BLAST search results may be displayed as a Whole genome view:
This view contains per-chromosome results (highlighted by strong/weak or other criteria, TBD).