Closed laceysanderson closed 6 years ago
This will be an administrator form to differentiate it from the raw phenotypes. The Upload Page should be at admin/tripal/extensions/analyzed-phenotypes/upload
. The page before that should simply be a listing of links like admin/tripal
. Theming should be minimal in order to match the administrative theme of the site.
The upload process will be a multi-step form:
Upload file (tab-delimited) will have the following columns:
In this proof shows a 3-stage-data-loader for analyzed phenotypes. The design or layout goal is to pattern the interface to match the visual appearance of Tripal admin pages. The overall layout of the page shows the stage indicator (arrows pointing to the right/forward) followed by an autocomplete textfield form element and an area for validation result or Drag and Drop.
Below is a note about distributing the validation process to 1. upload process 2. as a Tripal job. Validation as part of the upload process handles minor or basic validation such as, check to ensure project was selected, file had the right number of columns, columns matched the required column headers etc. In the second process, when file passed basic validation, the module passes the extensive validation of data, in rows and columns, to the server as a Tripal job request. This method will allow the module to manage server resources more efficiently.
VALIDATION WITHOUT ERRORS.
Next step button is added to instruct user to proceed. Consistent with Tripal admin pages, this button is left aligned and outside the the form element container/fieldset.
VALIDATION WITH ERRORS.
Similar to rawphenotypes, errors detected are listed followed by a failed status message and Drag and Drop to allow user to re-upload.
In this page, admin is requested to fully describe all traits. A validation result window instructs user to complete forms as well as inform about relevant details regarding the file uploaded. Each trait detected has a set of form field elements and a title organized into one fieldset/container. Traits are sequentially numbered, related form fields are grouped together and the form for the first trait is uncollapsed on load, all to guide user when filling out form entries.
User will be notified of described and undescribed number of traits before proceeding.
Note: Please confirm that a summary table is expected in Stats section (in the specs - Include a stats section calculated by us.)
Finally, my favourite stage :) is where data gets processed and stored. It illustrates the completed stage indicator, a series of warnings and status messages and finally a progress bar to show among other things, the progress.
let me know...
First off, I Love the mockups! They are exactly what I was picturing. Also, I completely agree with the two-step validation: fast validation done on upload, line-by-line validation done in Tripal job --Good Solution!
Image below illustrates the flow of loading data to AP.
Revised pages of Loader (in the order of stages shown above). 01 UPLOAD 01 UPLOAD / NO VALIDATION ERRORS 01 UPLOAD / WITH VALIDATION ERRORS
02 VALIDATE 02 VALIDATE / NO VALIDATION ERRORS 02 VALIDATE / WITH VALIDATION ERRORS
03 DESCRIBE
04 SAVE
Thanks!
Looks perfect except for the "suggestions" in "03 DESCRIBE". I think this being a drop-down is very confusing and this would be better shown as a list. I would expect something more along the lines of "Possible Crop Ontology Term(s): Plant Height, Canopy Height, First Node Height" if the Trait name was "Height".
Questions/Clarifications:
Stage 2 - Validate:
Stage 3 - Describe: CVTERM NAME
PHOTO Suggestion: use the cvterm id number plus a sequence number plus the file extension. example: cvterm: 2132_2.gif where 2132 is the cvterm id and 2 is photo # 2. This method does not require a table, just need to remember directory we are saving photo in. :)
Summary Table Is the source of data from the file or stored records where site-year is table phenotype/ field location and year, min is min value of the record set, max is the max value of the record set, mean is the sum of the values divided by number of rows and standard deviation - need to google this :) ?
Saving Line
I hope I make sense and please let me know. Thanks!
Is the unique combination validation unique in the file or in the db records?
It should be unique in the database.
When inserting a cvterm a DB is specified. What are the possible DBs and the default? When inserting a cvterm will the cv_id value be cv_id = phenotype_measurement_types or should we create one specific to this module (eg. analyzedphenotype_measurement_type)?
These should be configurable. More specifically, your module should create a settings form at Admin > Tripal > Extensions > Analyzed Phenotypes > Settings that allows the admin to select an existing cv and db per organism.genus. You should then use these ontologies when checking to see if a term already exists. Furthermore, the admin should be able to specify whether you can add new terms or not. Some sites will want to keep the ontologies pure; whereas, others will want to build them as they go.
When relating the term to the unit in CVTERM_RELATIONSHIP. Will the type_id value be cv_id = phenotype_measurement_units or should we create one specific to this module (eg. analyzedphenotypes_measurement_units)?
This should be stored the same way it is for the crop ontologies. However, I don't know what that is off the top of my head. I'll look into it and reply back later.
Could not figure out how to store :(
This should be stored the same way it is for the crop ontologies. However, I don't know what that is off the top of my head. I'll look into it and reply back later.
Looking at kp_entities module, I think ontology is stored as cvterm with a specific cv. The cvs in this case were LENTIL CROP ONTOLOGY, CHICKPEA CROP ONTOLOGY and so on (default-namespace in the obo file) with corresponding record in tripal_cv_ob.
There are actually multiple cvs created per crop ontology. For example, the Lentil Crop ontology creates the following cvs: "Crop Ontology, Lentil Variable", "Crop Ontology, Lentil Scale", "Crop Ontology, Lentil Method", "Crop Ontology, Lentil Trait", "Lentil Crop Ontology". Traits are contained in the "Crop Ontology, Lentil Trait"
What namespace/cv should we use for Crop Ontology?
This is dependant upon the organism.genus selected for the current data file. If the data file refers to "Lens" then we should use the "Crop Ontology, Lentil Trait" cv. This should also be configured in the same section as the cv/db per genus above. It also might be good to make this comparison optional since some sites might decide to use the crop ontology directly rather then map to it.
What namespace/cv should we use for Plant Trait Ontology?
The Plant Trait ontology currently can't be loaded into Tripal due to an incompatibility in the OBO format. I would comment out this section for now.
Please confirm: Ontology and Trait as CVTERM_RELATIONSHIP where type_id = cv_id of cv describe above or should we create a term specific to this module to describe the relationship (eg. analyzedphenotype_measurement_ontogoly)? object_id = cvterm_id of the Trait subject_id = cvterm_id of the Ontology
Use the following term as the type_id: cvtern.name=related, cv.name=synonym_type. Your subject and object are correct :-)
Suggestion: use the cvterm id number plus a sequence number plus the file extension. example: cvterm: 2132_2.gif where 2132 is the cvterm id and 2 is photo # 2. This method does not require a table, just need to remember directory we are saving photo in. :)
Sure, let's run with this :-) Just make sure the files are managed by Drupal.
As mentioned, when trait is found we auto fill the describe form with corresponding values. When modified (even just a word in description) becomes a new record. Will this be true if photos/ontology were changed?
No. If photos or ontology mapping are changes then we can just update the current trait.
Is the source of data from the file or stored records where site-year is table phenotype/ field location and year, min is min value of the record set, max is the max value of the record set, mean is the sum of the values divided by number of rows and standard deviation - need to google this :) ?
The source data is from the file. You're correct on how to calculate min, max, mean. Standard deviation (how spread out the numbers are: https://www.mathsisfun.com/data/standard-deviation.html) can be calculated by adding the following to our module (create analyzedphenotypes/api/analyzedphenotypes.api.inc and include it in our .module file):
if (!function_exists('stats_standard_deviation')) {
/**
* This user-land implementation follows the implementation quite strictly;
* it does not attempt to improve the code or algorithm in any way. It will
* raise a warning if you have fewer than 2 values in your array, just like
* the extension does (although as an E_USER_WARNING, not E_WARNING).
*
* @param array $a
* @param bool $sample [optional] Defaults to false
* @return float|bool The standard deviation or false on error.
*/
function stats_standard_deviation(array $a, $sample = false) {
$n = count($a);
if ($n === 0) {
trigger_error("The array has zero elements", E_USER_WARNING);
return false;
}
if ($sample && $n === 1) {
trigger_error("The array has only 1 element", E_USER_WARNING);
return false;
}
$mean = array_sum($a) / $n;
$carry = 0.0;
foreach ($a as $val) {
$d = ((double) $val) - $mean;
$carry += $d * $d;
};
if ($sample) {
--$n;
}
return sqrt($carry / $n);
}
}
Source: http://php.net/manual/en/function.stats-standard-deviation.php#114473
uniquename = trait name
The uniquename has to be unique for the measurement. Therefore it should be a combination of trait_id, project_id, location, year, stock_id, and rep. Just to be safe I throw the date in there too when generating phenotypic data. See https://github.com/UofS-Pulse-Binfo/generate_trpdata/blob/7.x-3.x/generate_trpdata.drush.inc#L854.
For location, replicate, year and data collector I can only query Location and Replicate (as rep) I believe from rawphenotypes module. Should we add a separate copy of these four term on install?
We want to use public ontologies as much as possible... However, in the interests of time, I stuck to terms that were already available with Tripal3. These are what I used for generating phenotypic data:
Replicate did need to be created (See https://github.com/UofS-Pulse-Binfo/generate_trpdata/blob/7.x-3.x/generate_trpdata.drush.inc#L732)... These will work for now but keep in mind they are not ideal... Perhaps it would be good to add an issue to github to find better public terms ;-).
I see that you created two new tables ap_phenotype and ap_phenotypeprop using hook_schema(). You will want to use chado.phenotype and chado.phenotypeprop instead as these tables already exist :-) Unfortunately, the tables that come with chado are missing a few columns so your module will need to check to see if the table matches your expectations and then alter it if it doesn't. This should be done on module enable. How to make the changes if they're not already done:
chado_query('ALTER TABLE {phenotype} ADD COLUMN project_id integer REFERENCES {project} (project_id)');
chado_query('ALTER TABLE {phenotype} ADD COLUMN stock_id integer REFERENCES {stock} (stock_id)');
chado_query('ALTER TABLE {phenotypeprop} ADD COLUMN cvalue_id integer REFERENCES {cvterm} (cvterm_id)');
Outstanding Question: How should we relate the trait to it's unit and scale? My answer is to follow the same method as the crop ontologies. However, I don't know what that is off the top of my head so I'm adding this here with the intent of looking into this later.
Additionally, I've added functionality to display data as your trait distribution chart and a summary table. These require two materialized views which will need to be syn'd after new data is loaded. You upload form should automatically submit a job to sync these two materialized views (mview_phenotype, mview_phenotype_summary).
Analyzed Phenotypes Data Downloader Mockup #1 In this mockup shows a download page similar to rawphenotypes download page. The top section, proceeding the main title, is a set of informative icons that represent a relevant type of data or filter available to user. When selected, a series of form elements, populated by more detailed filters, allow for more customized refinements desired. Retrieval of the entire dataset is also supported by clicking the all dataset option.
All textarea form elements are multi select and have the first option to include/select all.
Notes
The mockup looks beautiful :-)
I suggest grouping by category to provide similar functionality to your "select by icon" while still showing all filter criteria. You might want to use the genotype filter as an example: http://knowpulse.usask.ca/portal/chado/genotype/Lens.
@carolyncaron any other filter criteria suggestions? What are your thoughts on what the exported file should look like?
Mockup #2
This mockup shows data downloader with the minimum set to filter options. Below shows when all filters.
I have merged Germplasm accession and name into one field and user can type in accession or name.
Need more information on Allowed Missing data.
Thanks!
Looks good :-)
Think of the dataset as a table where each row is a specific germplasm and each column is a location/year combination (site-year). This filter says to not export germplasm which have more than a given number of columns missing. For example, if the filter is set to 20% and there are 10 site-years in the exported dataset then only 2 columns per row can be empty. Any germplasm with more then that should not be added to the downloaded file. Thus this filter is easiest to apply while building the file (not in the select query).
I'm loving how Mockup 2 looks 👍
Regarding trait selection, I don't think we want to restrict the user to only select one. My initial concern was that someone may select ALL traits by default, which we definitely want to discourage for the reasons you mention, @laceysanderson. I think we still want the multi-select but to remove the "All Traits" option. While this doesn't prevent the user from selecting every single trait, it would require more work on their part and thus they are less likely to do it. ;-) My reasons for allowing multiple traits are entirely based on my experience with TASSEL and GAPIT, which I think inevitably will be part of downstream analysis for many users of this module.
It occurred to me that my reasons are based heavily on experimental design. Perhaps if we are really concerned about users downloading too many traits, we can somehow limit them to only download all traits within an experiment. So, if they don't choose an experiment, they can select one trait. Otherwise, they get the multi-select option for traits limited to their selected experiment. What do you think?
I'm sorry to say it, but I don't think we should provide an option to filter for missing data. :-( I think that initially, when we have a few experiments based on using mostly the same germplasm, the filter for maximum allowed missing data could be useful. But over time, I can see there being issues with something like this. For example, assume that we measure plant height for Redberry at every site-year from now for the next five years. In five years, we continue to take plant height for Redberry but also for a new variety that was just released. If you let a few years pass, you now have a substantial number of measurements for Redberry, and a much smaller but still reasonable number of measurements for the new variety. Someone may select 50% missing data allowed and filter out the new variety as a result. It becomes very difficult to offer filter options based on stats with an ever-expanding database. Filtering % missing data using our VCF_filter module works well because it is restricted to individual files.
I also think it is very important that if a researcher has requested phenotypic data for specific germplasm (either by specifying an experiment, or multi-selecting germplasm names) then we shouldn't provide an additional filter that could potentially filter those germplasm out. I would opt to allow the researcher to do their own filtering based on statistics for their specified dataset (and hopefully they will do this using something like R!).
I think what we have now is a good amount to start with. It's hard to think of additional filters that aren't stats-based. @laceysanderson already pointed this out through chat, and allowing the user to select whether they want replicates (if they have permission to) is a great addition. 👍
I still really like the R-friendly format option in raw phenotypes, and I would appreciate seeing something like that here. Other than that, as far as what columns to include in the file, it is really tough to say since some of them will definitely appear redundant depending on filtering criteria. I always lean towards providing the most information we can, so that it is then up to the user to remove what they don't need, if necessary. @reynoldtan's suggestion in chat to allow the user to specify columns could be a great compromise to this, however! Perhaps all columns could be selected by default?
I was thinking that it might be worthwhile if Reynold and I set up a meeting time with Derek and/or other phenotype analysts to get their feedback on filtering criteria/export formats. I'm sure they will have a pretty good idea of what else they'd like to see (or, at least, what they don't like, lol).
Mockup #3
Click the link below to preview the header picker. https://myfiddle-reynoldltan.c9users.io/
Thanks!
Addressed through multiple PR
The Global Trust data needs to be available for access as soon as possible. As such, we need to make a first release of this module with extremely basic functionality. To meet the need for access, at a minimum we need an upload page for advanced users to submit data and a download page for public users to access data. I will add more specific details for each to this issue.