NCEAS / metacat

Data repository software that helps researchers preserve, share, and discover data
https://knb.ecoinformatics.org/software/metacat
GNU General Public License v2.0
27 stars 12 forks source link

[Bug] Authors missing from citations #1374

Open helbashandy opened 5 years ago

helbashandy commented 5 years ago

If two authors were added to a dataset with the same first and last name but different organizations, the citation will only include the name of one of them.

Please take a look at this example dataset: https://data-stage.ess-dive.lbl.gov/view/ess-dive-673951be334c80e-20190807T155749903

You'd find an author with the name test test5 twice with different organizations but only once in the citation.

When I looked into the returned Solr query result I found that the name was only there once in the origin field: origin":["test test1","test test2","test test3","test4 test4","test test5","test test6","test test7","test test8","test test9","test test10"]

mbjones commented 5 years ago

@HeshamElbashandy I think the SOLR index is a multi-valued field, and only keeps each unique value once. Do you have a real world use case (not made up test data) where two authors on the same data package share the exact same name? While there are many "Matthew Jones"'s in the world, I've yet to ever coauthor with another!

charuleka commented 5 years ago

@mbjones The most likely case for now is when projects upload data through the API and in some cases don't provide the full first name but just the initials or shortened names. Incomplete first names was an issue for quite a few datasets imported from CDIAC, but we haven't done a comprehensive search to figure out if any authors have been left off citations due to this bug.

mbjones commented 5 years ago

Thanks for the clarification.

charuleka commented 5 years ago

Given that we haven't had someone spot this so far, this can be set to a medium priority bug.

helbashandy commented 4 years ago

Hi @mbjones - The authorLastName field had a missing last name as it returned only the unique lastNames. This broke our citation for this specific dataset where we had two creators with the same lastName.

Dataset: https://data.ess-dive.lbl.gov/view/ess-dive-fe19bd3fbd441bc-20200526T210150114

taojing2002 commented 4 years ago

@helbashandy Would you make a screen shot? I can't read the document

helbashandy commented 4 years ago

@taojing2002, I added your Orcid as an admin. Let me know if you still can't see it.

csjx commented 4 years ago

@taojing2002 - We need to have this released as a patch release for 2.12.x as well as the newest release (2.14) since ESSDIVE won't be able to move to 2.13 for a while, so let's plan on two releases.

amoeba commented 4 years ago

Hey @helbashandy, we'd like to get this figured out. Sorry to ask you to add yet another subject to that group but could you please add http://orcid.org/0000-0002-0381-3766 so I can have a look? Alternatively, sharing the EML doc or docs in question is nearly as good.

Re:

This broke our citation for this specific dataset where we had two creators with the same lastName.

Could you describe or point at code where you generate your citation(s)? A stock install of MetacatUI generates the HTML template from what's in the origin field but it sounds like you may be using authorLastName and possibly other fields. Is that right?

helbashandy commented 4 years ago

Hi @amoeba, I just added your subject to the ess-dive-admins group.

The dataset in question is: https://data.ess-dive.lbl.gov/view/ess-dive-fe19bd3fbd441bc-20200526T210150114

The error happens as the way we build our citations is we index-match each contributor FirstName from the origin field with it's LastName from the authorLastName field. So if it doesn't match we wouldn't be able to properly build our citations.

The reason we don't get first and last name directly from the origin field is we don't know if there's any preposition to the last name or maybe a last name of any two (or more) words, and our citation requirement is to have the LastName listed in the citation.

We made a walkaround it, of which in case the authorLastName and the origin lists don't match in length, we use the origin (that is limited in functionality as it wouldn't be able to cover all use cases as it will not be able to also get prepositioned last names or last names with multiple words in the case of having two authors with the same last name). It's in the current release candidate and not deployed to production yet. Ideally, we'd build the citation from authorLastName directly.

Here's the dataset's EML:

<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd" packageId="ess-dive-fe19bd3fbd441bc-20200526T210150114" system="ess-dive">
    <dataset>
        <title>Locations, metadata, and species cover from field sampling survey associated with NEON AOP survey, East River, CO 2018</title>
        <creator id="5824075189483349">
            <individualName>
                <givenName>K. Dana</givenName>
                <surName>Chadwick</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>kdc@stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0002-5633-4865</userId>
        </creator>
        <creator id="7302742833026817">
            <individualName>
                <givenName>Kathleen</givenName>
                <surName>Grant</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>kdennist@usc.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0002-0219-1758</userId>
        </creator>
        <creator id="2183100484924005">
            <individualName>
                <givenName>Amanda</givenName>
                <surName>Henderson</surName>
            </individualName>
            <organizationName>Rocky Mountain Biological Laboratory</organizationName>
            <electronicMailAddress>amanda.henderson3@gmail.com</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0001-9613-5003</userId>
        </creator>
        <creator id="3481778203388660">
            <individualName>
                <givenName>Ian</givenName>
                <surName>Breckheimer</surName>
            </individualName>
            <organizationName>Rocky Mountain Biological Laboratory</organizationName>
            <electronicMailAddress>ikb@rmbl.org</electronicMailAddress>
        </creator>
        <creator id="7298594783291651">
            <individualName>
                <givenName>C. F. Rick</givenName>
                <surName>Williams</surName>
            </individualName>
            <organizationName>Rocky Mountain Biological Laboratory</organizationName>
            <electronicMailAddress>willcha2@isu.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0003-4138-308X</userId>
        </creator>
        <creator id="4245804751109434">
            <individualName>
                <givenName>Nicola</givenName>
                <surName>Falco</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>nicolafalco@lbl.gov</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0003-3307-6098</userId>
        </creator>
        <creator id="5545889295197919">
            <individualName>
                <givenName>Jaincong</givenName>
                <surName>Chen</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>jiancongchen@lbl.gov</electronicMailAddress>
        </creator>
        <creator id="8189473150680316">
            <individualName>
                <givenName>Hilary</givenName>
                <surName>Henry</surName>
            </individualName>
            <organizationName>University of California  Berkeley</organizationName>
            <electronicMailAddress>hilary_henry@berkeley.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0003-3486-140X</userId>
        </creator>
        <creator id="4020370407286974">
            <individualName>
                <givenName>Aizah</givenName>
                <surName>Khurram</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>akhurram@lbl.gov</electronicMailAddress>
        </creator>
        <creator id="7815053845632704">
            <individualName>
                <givenName>Jack</givenName>
                <surName>Lamb</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>jlamb@lbl.gov</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0002-6584-7697</userId>
        </creator>
        <creator id="2651111603090834">
            <individualName>
                <givenName>Maeve</givenName>
                <surName>McCormick</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>mmccorm2@alumni.stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0002-3275-8663</userId>
        </creator>
        <creator id="6705884109676126">
            <individualName>
                <givenName>Hailee</givenName>
                <surName>McOmber</surName>
            </individualName>
            <organizationName>Fort Lewis College</organizationName>
            <electronicMailAddress>hbmcomber@gmail.com</electronicMailAddress>
        </creator>
        <creator id="9409751840977876">
            <individualName>
                <givenName>Samuel</givenName>
                <surName>Pierce</surName>
            </individualName>
            <organizationName>Stanford Linear Accelerator Center - National Accelerator Laboratory</organizationName>
            <electronicMailAddress>swpierce@stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0001-7614-0227</userId>
        </creator>
        <creator id="7458421606901447">
            <individualName>
                <givenName>Alexander</givenName>
                <surName>Polussa</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>alexander.polussa@yale.edu</electronicMailAddress>
        </creator>
        <creator id="5048998533653692">
            <individualName>
                <givenName>Maceo</givenName>
                <surName>Hastings Porro</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>maceo1995@gmail.com</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0003-1587-8513</userId>
        </creator>
        <creator id="5216114848404514">
            <individualName>
                <givenName>Andea</givenName>
                <surName>Scott</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>andea98@stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0003-4016-7422</userId>
        </creator>
        <creator id="8138276137609632">
            <individualName>
                <givenName>Hans</givenName>
                <surName>Wu Singh</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>hwsingh@ucsd.edu</electronicMailAddress>
        </creator>
        <creator id="2377476364589170">
            <individualName>
                <givenName>Bizuayehu</givenName>
                <surName>Whitney</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>btw31@berkeley.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0001-5670-5411</userId>
        </creator>
        <creator id="2799864552081211">
            <individualName>
                <givenName>Eoin</givenName>
                <surName>Brodie</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>elbrodie@lbl.gov</electronicMailAddress>
        </creator>
        <creator id="3980441243926453">
            <individualName>
                <givenName>Rosemary</givenName>
                <surName>Carroll</surName>
            </individualName>
            <organizationName>Desert Research Institute</organizationName>
            <electronicMailAddress>rosemary.carroll@dri.edu</electronicMailAddress>
        </creator>
        <creator id="7588379448105595">
            <individualName>
                <givenName>Christian</givenName>
                <surName>Dewey</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>cwdewey@stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0003-1954-8298</userId>
        </creator>
        <creator id="1731924118756914">
            <individualName>
                <givenName>Lara</givenName>
                <surName>Kueppers</surName>
            </individualName>
            <organizationName>University of California  Berkeley</organizationName>
            <electronicMailAddress>lmkueppers@lbl.gov</electronicMailAddress>
        </creator>
        <creator id="1512116637115038">
            <individualName>
                <givenName>Taylor</givenName>
                <surName>Maavara</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>taylor.maavara@yale.edu</electronicMailAddress>
        </creator>
        <creator id="4877392558614621">
            <individualName>
                <givenName>Heidi</givenName>
                <surName>Steltzer</surName>
            </individualName>
            <organizationName>Fort Lewis College</organizationName>
            <electronicMailAddress>Steltzer_H@fortlewis.edu</electronicMailAddress>
        </creator>
        <creator id="5742111321836169">
            <individualName>
                <givenName>Kenneth</givenName>
                <surName>Williams</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>khwilliams@lbl.gov</electronicMailAddress>
        </creator>
        <creator id="4905906110566510">
            <individualName>
                <givenName>Katherine</givenName>
                <surName>Maher</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>kmaher@stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0002-5982-6064</userId>
        </creator>
        <associatedParty id="8222482607504639">
            <organizationName>U.S. DOE &#x3E; Office of Science &#x3E; Biological and Environmental Research (BER)</organizationName>
            <userId directory="unknown">http://dx.doi.org/10.13039/100006206</userId>
            <role>fundingOrganization</role>
        </associatedParty>
        <associatedParty id="3023418096983174">
            <organizationName>NSF EAR Postdoctoral Fellowship, Chadwick, ID: 1725788</organizationName>
            <role>fundingOrganization</role>
        </associatedParty>
        <associatedParty id="3059548327922574">
            <organizationName>U.S. Department of Energy BER award, PI: Maher, DE-SC0018155 (PI: Maher)</organizationName>
            <role>fundingOrganization</role>
        </associatedParty>
        <associatedParty id="7050320276762481">
            <organizationName>LBNL SFA 2.0</organizationName>
            <role>fundingOrganization</role>
        </associatedParty>
        <associatedParty id="8775039284635195">
            <individualName>
                <givenName>Thomas</givenName>
                <surName>Powell</surName>
            </individualName>
            <organizationName>Lawrence Berkeley National Laboratory</organizationName>
            <electronicMailAddress>tlpowell@lbl.gov</electronicMailAddress>
            <role>contributor</role>
        </associatedParty>
        <pubDate>2020</pubDate>
        <abstract>
            <para>Locations and descriptions of the sites where field sampling was conducted during the 2018 National Ecological Observatory Network (NEON) Airborne Observation Platform (AOP) imaging spectroscopy and lidar surveys in Gunnison County, Colorado. The sampling sites were located across East River, Washington Gulch, Slate River, and Coal Creek watersheds and contained a mixture of meadow, shrub, and tree sampling sites. This data package contains the location, metadata, species composition, and vector files for the sampling sites in this project. Vector files provided here are for areas delineated based on the AOP reflectance data and mosaic developed by Brodrick et al. (doi: 10.15485/1618131), and use with other datasets should be closely evaluated. Subsequent data packages will contain additional biogeochemical, microbial, and geophysical data as they become available for these sampling sites. For full documentation, please see associated reference and additional references will be added here as they become available.</para>
        </abstract>
        <keywordSet>
            <keyword>Species cover</keyword>
            <keyword>vegetation height</keyword>
            <keywordThesaurus>CATEGORICAL:NONE</keywordThesaurus>
        </keywordSet>
        <keywordSet>
            <keyword>EARTH SCIENCE &#x3E; BIOSPHERE &#x3E; VEGETATION</keyword>
            <keywordThesaurus>CATEGORICAL:GCMD</keywordThesaurus>
        </keywordSet>
        <keywordSet>
            <keyword>EARTH SCIENCE &#x3E; BIOSPHERE &#x3E; VEGETATION &#x3E; CROWN</keyword>
            <keyword>EARTH SCIENCE &#x3E; BIOSPHERE &#x3E; VEGETATION &#x3E; VEGETATION COVER</keyword>
            <keyword>EARTH SCIENCE &#x3E; BIOSPHERE &#x3E; VEGETATION &#x3E; VEGETATION SPECIES</keyword>
            <keyword>EARTH SCIENCE &#x3E; LAND SURFACE &#x3E; SOILS &#x3E; SOIL MOISTURE/WATER CONTENT</keyword>
            <keywordThesaurus>VARIABLE:GCMD</keywordThesaurus>
        </keywordSet>
        <keywordSet>
            <keyword>vegetation_area_fraction</keyword>
            <keyword>canopy_height</keyword>
            <keywordThesaurus>VARIABLE:CF</keywordThesaurus>
        </keywordSet>
        <additionalInfo>
            <section>
                <title>Related References</title>
                <para>Chadwick, et al, Integrating airborne remote sensing and field campaigns for ecology and Earth system science. In review</para>
                <para>Brodrick P; Goulden T; Chadwick KD (2020): Custom NEON AOP reflectance mosaics and maps of shade masks, canopy water content. Watershed Function SFA. DOI:&#xA0;10.15485/1618131
</para>
                <para>Chadwick KD; Grant K; Henderson A; Scott A; McCormick M; Pierce S; Hastings Porro M; Maher K (2020): Leaf mass per area and leaf water content measurements from field survey in association with NEON AOP survey, East River, CO 2018. A Multiscale Approach to Modeling Carbon and Nitrogen Cycling within a High Elevation Watershed.&#xA0;DOI:&#xA0;10.15485/1618132</para>
            </section>
        </additionalInfo>
        <intellectualRights>
            <para>This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.</para>
        </intellectualRights>
        <coverage>
            <temporalCoverage>
                <rangeOfDates>
                    <beginDate>
                        <calendarDate>2018-06-14</calendarDate>
                    </beginDate>
                    <endDate>
                        <calendarDate>2018-07-30</calendarDate>
                    </endDate>
                </rangeOfDates>
            </temporalCoverage>
            <geographicCoverage>
                <geographicDescription>Upper east river watersheds including, East River, Coal Creek, Washington Gulch, and Slate River</geographicDescription>
                <boundingCoordinates>
                    <westBoundingCoordinate>-107.1</westBoundingCoordinate>
                    <eastBoundingCoordinate>-106.8</eastBoundingCoordinate>
                    <northBoundingCoordinate>39</northBoundingCoordinate>
                    <southBoundingCoordinate>38.8</southBoundingCoordinate>
                </boundingCoordinates>
            </geographicCoverage>
        </coverage>
        <contact id="9856233758930376">
            <individualName>
                <givenName>K. Dana</givenName>
                <surName>Chadwick</surName>
            </individualName>
            <organizationName>Stanford University</organizationName>
            <electronicMailAddress>kdc@stanford.edu</electronicMailAddress>
            <userId directory="https://orcid.org">https://orcid.org/0000-0002-5633-4865</userId>
        </contact>
        <publisher id="9539916364462432">
            <organizationName>Watershed Function SFA</organizationName>
        </publisher>
        <methods>
            <methodStep>
                <description>
                    <para>We designed an integrated sampling campaign to collect vegetation, soil, microbial, and geophysical data at co-located sampling sites to coincide with the timing of the AOP survey. We planned to collect samples within 72 hours of overflight, and without intervening rainfall, to the extent possible, given the rapid and sometimes unknown rates of change in processes of interest. The number of sampling sites was determined based on the amount of canopy samples desirable for sufficient ground truth data for developing statistical models to map foliar traits. Based on sample sizes in successful canopy-scale trait modeling projects (Singh, Serbin, McNeil, Kingdon, &#x26; Townsend, 2015; K. Chadwick &#x26; Asner, 2016; Martin et al., 2018), as well as previous modeling efforts (G. Asner &#x26; Martin, 2008), we aimed to collect samples from 400 sites across the 330 km2 area: 200 meadows, 100 trees, and 100 shrubs. To achieve this, we planned for collections to take 10 days within the duration of the AOP survey, with daily collections of 20 1x1 m meadow plots, 10 trees, and 10 shrubs (total of 40 ‘sites’) per day within a sampling area.</para>
                </description>
            </methodStep>
            <methodStep>
                <description>
                    <para>We organized a field team of 32 individuals (15-20 per day) to participate in the field collections and process samples as needed for preservation at the Rocky Mountain Biological Laboratory (RMBL). In order to standardize the sampling procedures for each area and account for turnover in field team members, we conducted training in advance, distributed detailed protocols and flowcharts to each sampling team, and scheduled people to maintain continuity of at least one team member. We planned for four teams to conduct physical sample collection each day, two sampling meadow plots, and one each for tree and shrub individuals. In addition to the physical sample collection, we planned for one team to rotate across the sampling area, visiting all sampling sites to measure soil moisture and depth, and collect high-accuracy GPS data using a Real-Time Kinematic (RTK) enabled GPS receiver.</para>
                </description>
            </methodStep>
            <methodStep>
                <description>
                    <para>In order to ensure the sampled vegetation was not shaded by topography, clouds, or adjacent vegetation during the time of overflights, the AOP team provided preliminarily processed data within 24-36 hours of the day’s survey. These data included a coarse orthorectification and processing of radiance data to four-band, red, green, blue, and near IR images. Our teams were equipped with iPads with these data loaded onto them to enable notation of approximate locations of sampling and photograph the site, to augment the cm-scale accuracy GPS location collected independently. When selecting individual sampling sites at the day's sampling area, each team chose plots or individuals with high vegetative cover and homogeneous characteristics relative to the immediately surrounding area. In addition, within a sampling area we selected a set of sites that represented a range of species and species assemblages. We collected samples from 437 sample sites within 72 hours of the AOP survey across 12 study areas within the sampling domain. These collections took place over the course of three weeks in June of 2018. In addition, we collected follow up samples from 40 sites at an additional sampling area, as well as with return trips, within the two subsequent weeks from conifers and riparian willows, which were less susceptible to the rapid phenological change experienced by meadow systems. In all, this resulted in 487 foliar samples and sites with documented species cover, provided here.</para>
                </description>
            </methodStep>
            <methodStep>
                <description>
                    <para>For tree and shrub sites, the species was recorded as 100% cover and a voucher was taken. For meadow sites, there were the following additional documentation requirements. Once the area was selected, the team took a maximum and median vegetation height measurement for the plot. The quad was placed onto the plot and the fractional cover of all species that make up &#x3E; 5% of the area were recorded, as well as the fraction of soil and litter visible from above the plot. For species that did not meet the 5% cover threshold, the cover was documented by functional type. For any species that were not identifiable by the field team, a voucher sample was taken from outside the plot for identification at RMBL herbarium.</para>
                    <para>We collected voucher specimens to validate and document field identifications or identify unknown samples of trees, shrubs and herbaceous vegetation in meadow plots. We collected reproductive individuals (flowers and/or fruits) when possible to facilitate identification. Specimens were placed in a field press, noting on newspaper: date, collector, site number and species letter identifier (e.g., unknown A, etc.) corresponding to its designation in field notes or percent cover estimation. Upon returning to herbarium, the plant press was placed on the drier for 2-3 days to completely dry the samples. Samples were identified using regional keys and manuals, and by comparison with specimens in the herbarium. In general, plant names follow the taxonomy of Weber and Wittmann (2012), although numerous sources were used for identification, including:  Colorado Flora: Western Slope (Weber &#x26; Wittmann, 2012) , Flora of Colorado (Ackerfield, 2015), Intermountain Flora (Cronquist, Holmgren, Holmgren, Reveal, &#x26; Holmgren, 1977) , Grasses of Colorado (Shaw, 2008), and Sedges of Colorado (Wingate, 2017). Voucher samples will be made into herbarium specimens and deposited in the RMBL herbarium where they will be digitized and made available online through the Consortium of Southern Rocky Mountain Herbaria (www.soroherbaria.org ), and iDigBio (www.idigbio.org ).</para>
                </description>
            </methodStep>
            <methodStep>
                <description>
                    <para>With the combined use of the high-accuracy GPS data and the GPS points recorded in the iPad software based on the preliminary imagery that we received from the AOP team, we generated polygons that represent the spatial extent of each sampling site. For the meadow plots, we collected GPS points from each corner, and we defined a polygon for the plot based on the pixels that had the most spatial coverage within the corner points. For shrubs and trees, we outlined the extent of the crown for each individual that was sampled. This method allows for the selection of all pixels that are associated with the individual that was sampled, rather than selecting an arbitrary distance from the GPS point that was collected. Utilizing an estimated crown diameter to automatically define pixels is problematic because of the non-uniform nature of tree canopies. This procedure allowed us to circumvent the challenging problem of absolute geospatial accuracy of both field and aircraft data by utilizing field and expert judgement to determine relative alignment between data sets and identify pixels for extraction. Please note that these shapefiles are specific to the VSWIR data collected by the NEON AOP in 2018 (Goulden et al, 2020) and specifically, the atmospheric correction and mosaicing procedure described in the reference paper: Chadwick et al., with a DOI: Brodrick et al. The precise location of these shapefiles may be offset from other data products and should be evaluated accordingly.</para>
                </description>
            </methodStep>
        </methods>
        <project>
            <title>Watershed Function SFA</title>
            <personnel id="6439498454274350">
                <individualName>
                    <givenName>Susan</givenName>
                    <surName>Hubbard</surName>
                </individualName>
                <organizationName>Lawrence Berkeley National Laboratory</organizationName>
                <electronicMailAddress>sshubbard@lbl.gov</electronicMailAddress>
                <role>principalInvestigator</role>
            </personnel>
        </project>
        <otherEntity id="ess-dive-401ed9dd8cbdd7d-20200518T231623498">
            <entityName>CRBU2018_AOP_Crowns.geojson</entityName>
            <entityType>application/octet-stream</entityType>
        </otherEntity>
        <otherEntity id="ess-dive-e152013275b5b57-20200518T231623518">
            <entityName>metadata_column_key.csv</entityName>
            <entityType>text/csv</entityType>
        </otherEntity>
        <otherEntity id="ess-dive-8d4d3e2299ca6a4-20200518T231623523">
            <entityName>raw_rtk_gps_points.csv</entityName>
            <entityType>text/csv</entityType>
        </otherEntity>
        <otherEntity id="ess-dive-976b3275a7932fd-20200518T231623538">
            <entityName>species_list.csv</entityName>
            <entityType>text/csv</entityType>
        </otherEntity>
        <otherEntity id="ess-dive-fb0203b86b1e2bb-20200519T192123487">
            <entityName>fractional_cover.csv</entityName>
            <entityType>text/csv</entityType>
        </otherEntity>
        <otherEntity id="ess-dive-16470e737c3ad18-20200519T192123496">
            <entityName>sample_site.csv</entityName>
            <entityType>text/csv</entityType>
        </otherEntity>
        <otherEntity id="ess-dive-1dd326184995adf-20200519T192141100">
            <entityName>sampling_area.csv</entityName>
            <entityType>text/csv</entityType>
        </otherEntity>
    </dataset>
</eml:eml>
amoeba commented 4 years ago

Hey @helbashandy, thanks for the info there.

As far as I can tell, @mbjones is right on here: Fields like authorLastName are multi-valued which is good but only store unique values. This is a property of the StrField type in Solr and its INDEXED property. Turning off indexing on a field makes it so all values, regardless of uniqueness, are returned. The big, glaring downside is that indexing is required for searching against a field so disabling indexing for a field makes it no longer searchable (just returnable).

I'm not sure if we want to change anything in the Metacat default schema but if you're willing to keep a customized schema, I think the way to go would be to:

Note: For some odd reason, I found simply setting indexed=false on the field didn't work and that I had to make a new fieldType. :shrug:

I had trouble getting this changed picked with a simple reload of the core from the Solr admin UI so I ended up restarting the Solr service (and Tomcat for good measure) and the change got picked for new index tasks.

Does something like this seem like it'd fix your issue? I think there are a few other routes we could go down here so let us know.

helbashandy commented 4 years ago

Thank you @amoeba for the suggestions!

Yes, that pretty much solves the issue if we had a custom field. However, we hadn't tried to update our schema.xml file before so this would require a little bit customization on how we build our metacat image. We'll probably go with this path forward.

However, just to double check. Even when using the newer version of Solr (after the solr seperation out of metacat), there wouldn't be a capability to have the field indexed and also non-unique?

amoeba commented 4 years ago

However, just to double check. Even when using the newer version of Solr (after the solr seperation out of metacat), there wouldn't be a capability to have the field indexed and also non-unique?

That's what I understand and a bit of testing has backed this up.

I should have mentioned this but I suspect you can get both things at once (the ability to search and the ability to get all values, not just uniques): Set up the field(s) in question as they are currently (e.g., indexed) and set up a copyField directive to a non-indexed version of it and update your CitationView to be built off the non-indexed copy of the field. We have examples of copyField directives in the schema.xml we ship with Metacat. I haven't tested it but it seems like another route to go down if you really don't want to break searching within a field like authorLastName.

helbashandy commented 4 years ago

Thank you for the suggestion @amoeba! Does that mean that the copyField directive will copy the non-unique values from an unique indexed field to the custom non-indexed version of the field?

amoeba commented 4 years ago

That's my guess. From the docs:

Fields are copied before analysis is done, meaning you can have two fields with identical original content, but which use different analysis chains and are stored in the index differently.

I haven't verified this but mention it in case that setup might be better than another.

mbjones commented 3 years ago

@taojing2002 @amoeba @csjx what is our plan for addressing this?

amoeba commented 3 years ago

Hey @mbjones, thanks for checking up on this. I think we should touch base with @helbashandy to see if they've had a chance to try my suggestions. I can do that via Slack here shortly. I'm not sure if either of my suggestions are changes we'd want to make to the stock Solr schema we ship with Metacat but we should probably at least talk about it. I'll toss it on the agenda for this week's developer call.

robyngit commented 2 years ago

FYI @taojing2002, this was identified as a high priority issue in MetacatUI, see https://github.com/NCEAS/metacatui/issues/1044