AtlasOfLivingAustralia / biocache-service

Occurrence & mapping webservices
https://biocache-ws.ala.org.au/ws/
Other
9 stars 26 forks source link

Download file and breakdown counts truncated #692

Closed nickdos closed 1 year ago

nickdos commented 3 years ago

Reported by a user (ticket 115306) and Looks similar to #678.

See this download DOI: https://doi.ala.org.au/doi/10.26197/ala.aa4b433a-efa7-4e7f-9eba-aeb16476a24c

The total records is 1,315,567 but download only contains 441 records and the datasets breakdowns also show 432 & 8 values.

I'm wondering if this filter is the problem:

nickdos commented 3 years ago

Adding a second example from another user.

https://biocache.ala.org.au/occurrences/search?q=lsid%3Ahttps%3A%2F%2Fid.biodiversity.org.au%2Fnode%2Fapni%2F2891324&qualityProfile=ALA

Should show 4155 records with DQ filters on ALA general.

Click download, select "customised download" format, use the "select all" button top of page, click next.

Resulting download should contain 4155 records but only has 963 data rows. ~Also the ZIP file is missing the extra files: README.html, citations.csv, headings.csv and DOI.txt.~

If you only select half of the options on the "customised" page, then the download works as expected. This suggests we are hitting a limit for a URL or param string in the download service.

brucehyslop commented 2 years ago

the reduced records in the CSV appears to be because there was a failure with the streaming query the exception message is null object originating at https://github.com/AtlasOfLivingAustralia/biocache-service/blob/b92b19049ebb8caa86b9bf738b9ebc681c54ba3e/src/main/java/au/org/ala/biocache/dao/SolrIndexDAOImpl.java#L1074

The error logged:

2021-10-07 14:38:52,639 [biocache-query-offline-7] ERROR au.org.ala.biocache.dao.SolrIndexDAOImpl  (SolrIndexDAOImpl.java:389) - Exception - query failed - SOLRQuery: q=lft:[579944+TO+579944]&fq=-outlierLayerCount:[3+TO+*]&fq=-coordinateUncertaintyInMeters:[10001+TO+*]&fq=-year:[*+TO+1700]&fq=-establish
mentMeans:"MANAGED"+AND+-decimalLatitude:0+AND+-decimalLongitude:0+AND+-assertions:"PRESUMED_SWAPPED_COORDINATE"+AND+-assertions:"COORDINATES_CENTRE_OF_STATEPROVINCE"+AND+-assertions:"COORDINATES_CENTRE_OF_COUNTRY"+AND+-assertions:"PRESUMED_NEGATED_LATITUDE"+AND+-assertions:"PRESUMED_NEGATED_LONGITUDE"&
fq=-basisOfRecord:"FOSSIL_SPECIMEN"+AND+-(basisOfRecord:"MATERIAL_SAMPLE"+AND+contentTypes:"EnvironmentalDNA")&fq=-userAssertions:50001+AND+-userAssertions:50005&fq=-occurrenceStatus:ABSENT&fq=-spatiallyValid:"false"&fq=-assertions:TAXON_MATCH_NONE+AND+-assertions:INVALID_SCIENTIFIC_NAME+AND+-assertions
:TAXON_HOMONYM+AND+-assertions:UNKNOWN_KINGDOM+AND+-assertions:TAXON_SCOPE_MISMATCH&fq=-duplicateType:"DIFFERENT_DATASET"&rows=-1&start=0&fl=dataResourceUid,imageIDs,raw_recordedBy,modified,language,license,rightsHolder,accessRights,bibliographicCitation,references,institutionID,collectionID,datasetID,i
nstitutionCode,collectionCode,datasetName,ownerInstitutionCode,basisOfRecord,informationWithheld,dataGeneralizations,dynamicProperties,provenance,rights,source,type,occurrenceID,catalogNumber,recordNumber,recordedBy,individualCount,organismQuantity,organismQuantityType,sex,lifeStage,reproductiveConditio
n,behavior,establishmentMeans,occurrenceStatus,preparations,disposition,associatedMedia,associatedReferences,associatedSequences,associatedTaxa,otherCatalogNumbers,occurrenceRemarks,organismID,id,organismName,associatedOccurrences,previousIdentifications,eventID,parentEventID,fieldNumber,eventDate,event
Time,year,month,day,verbatimEventDate,habitat,samplingProtocol,samplingEffort,sampleSizeValue,sampleSizeUnit,fieldNotes,eventRemarks,locationID,higherGeography,continent,waterBody,islandGroup,island,country,countryCode,stateProvince,county,municipality,raw_locality,verbatimLocality,minimumElevationInMet
ers,maximumElevationInMeters,verbatimElevation,verbatimElevation,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,verbatimDepth,locationAccordingTo,locationRemarks,decimalLatitude,decimalLongitude,geodeticDatum,coordinateUncertaintyInMeters,coordinatePrecision,verbatimCoordinates,raw_decimalLatit
ude,verbatimLatitude,raw_decimalLongitude,verbatimLongitude,raw_geodeticDatum,verbatimCoordinateSystem,verbatimSRS,footprintSRS,georeferencedBy,georeferencedDate,georeferenceProtocol,georeferenceSources,georeferenceVerificationStatus,georeferenceRemarks,identificationID,identificationQualifier,typeStatu
s,identifiedBy,dateIdentified,identificationReferences,identificationVerificationStatus,identificationRemarks,identifierRole,taxonID,scientificNameID,acceptedNameUsageID,taxonConceptID,scientificName,acceptedNameUsage,parentNameUsage,originalNameUsage,nameAccordingTo,namePublishedIn,higherClassification
,kingdom,phylum,class,order,family,genus,subgenus,specificEpithet,infraspecificEpithet,taxonRank,verbatimTaxonRank,scientificNameAuthorship,vernacularName,nomenclaturalCode,taxonomicStatus,nomenclaturalStatus,taxonRemarks,species,measurementDeterminedDate,measurementRemarks,measurementValue,measurementM
ethod,measurementID,measurementType,measurementUnit,measurementDeterminedBy,measurementAccuracy,countryConservation,stateConservation,speciesGroup,speciesSubgroup,el704,el830,el2094,el772,el722,el729,el819,el898,el2099,el848,el891,el790,el765,el711,el797,el718,el887,el894,el2043,el725,el671,el1013,el591
,el715,el841,el1036,el1074,el708,el783,el674,el793,el1078,el1073,el2016,el682,el1079,el789,el827,el1037,el726,el862,el949,el2089,el2101,el743,el786,el2102,el737,el645,el888,el2093,el863,el996,el955,el672,el2126,el1077,el681,el950,el744,el870,el2044,el2095,el798,el948,el791,el890,el2042,el705,el751,el792
,el2091,el876,el843,el666,el816,el810,el1019,el883,el2119,el785,el1056,el836,el889,el1020,el673,el662,el947,el2018,el788,el814,el795,el1080,el680,el2092,el2097,el878,el731,el774,el781,el954,el784,el1011,el867,el749,el951,el2088,el665,el875,el2100,el882,el647,el1002,el720,el892,el713,el799,el766,el2104,e
l1006,el989,el2103,el872,el728,el796,el787,el879,el1038,el881,el746,el730,el1072,el719,el2017,el670,el2098,el865,el794,el609,el957,el2090,el668,el874,el598,el683,el2096,el615,el952,el1081,el860,el767,el893,el742,el886,el806,el676,el747,el820,el782,el663,el866,el833,el707,el899,el1055,cl10930,cl10941,cl2
012,cl1070,cl987,cl110924,cl620,cl10936,cl10947,cl941,cl1061,cl962,cl10903,cl10925,cl613,cl2077,cl2021,cl10847,cl991,cl2049,cl1060,cl901,cl10909,cl2078,cl10937,cl110944,cl605,cl10942,cl2086,cl2087,cl2045,cl1065,cl1066,cl110948,cl23,cl2013,cl10848,cl2081,cl10924,cl12115,cl10946,cl10902,cl927,cl10938,cl21
16,cl110957,cl929,cl10927,cl12079,cl1918,cl619,cl1076,cl2085,cl22,cl990,cl1062,cl936,cl10943,cl2109,cl110923,cl3004,cl2125,cl12081,cl10831,cl10940,cl988,cl510927,cl10945,cl606,cl110925,cl1048,cl1057,cl916,cl10834,cl908,cl10948,cl1085,cl923,cl1053,cl11061,cl10935,cl614,cl910,cl1051,cl963,cl10907,cl930,cl
210927,cl1059,cl611,cl2124,cl618,cl1054,cl1942,cl10933,cl1063,cl3,cl928,cl2084,cl10922,cl1068,cl10900,cl10833,cl964,cl21,cl2110,cl932,cl10906,cl939,cl10939,cl935,cl10928,cl310927,cl10956,cl3003,cl10934,cl110928,cl917,cl2010,cl2079,cl10912,cl942,cl1069,cl2105,cl2052,cl10872,cl410927,cl911,cl1064,cl10921,
cl2048,cl10832,cl2083,cl10829,cl10955,cl914,cl2080,cl2015,cl907,cl110922,cl110927,cl20,cl2050,cl1067,cl938,cl913,cl10835,cl2009,cl2076,cl612,cl2120,cl1058,cl10905,cl678,cl617,cl1071,cl2111,cl110945,cl965,cl906,cl1049,cl961,cl2020,cl10908,cl2022,cl10923,cl959,cl925,cl958,cl1052,cl960,cl2113,cl10926,cl926
,cl12078,cl10910,cl2115,cl12080,cl2117,cl966,cl10944,cl604,cl940,cl10957,cl1050,cl12021,cl1084,cl10929,cl10874,assertions,dataProviderUid,institutionUid,collectionUid,lft,rgt,sensitive : null object

The resulting exception is effectively ignored because the we are looping until the download thread isDone resulting in a partially completed download with nothing reported to the user.

https://github.com/AtlasOfLivingAustralia/biocache-service/blob/b92b19049ebb8caa86b9bf738b9ebc681c54ba3e/src/main/java/au/org/ala/biocache/dao/SearchDAOImpl.java#L753-L773

brucehyslop commented 2 years ago

The error streaming the download related to the parsing of dynamicProperties json text content.

There where two issues

  1. Tuple::getString returns the string "null" for null values which fail to parse as json https://github.com/AtlasOfLivingAustralia/biocache-service/blob/b92b19049ebb8caa86b9bf738b9ebc681c54ba3e/src/main/java/au/org/ala/biocache/stream/ProcessDownload.java#L313

  2. there are occurrence records with invalid json for dynamicProperties see 1a3b6c20-e9bd-44a3-a625-f03be23695e8 which has double escaped quote chars /"/"

    
    http://nci3-solr-1.ala:8983/solr/biocache/select?fl=dynamicProperties&fq=id%3A%221a3b6c20-e9bd-44a3-a625-f03be23695e8%22&q.op=OR&q=lft%3A%5B579944%20TO%20579944%5D&rows=10&start=0
    {
    "responseHeader":{
    "zkConnected":true,
    "status":0,
    "QTime":461,
    "params":{
      "q":"lft:[579944 TO 579944]",
      "fl":"dynamicProperties",
      "start":"0",
      "q.op":"OR",
      "fq":"id:\"1a3b6c20-e9bd-44a3-a625-f03be23695e8\"",
      "rows":"10",
      "_":"1633578588360"}},
    "response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
      {
        "dynamicProperties":"\"{\"\"determinationfiledas\"\": \"\"Yes\"\", \"\"collectionkind\"\": \"\"Sheet\"\", \"\"gbifissue\"\": [\"\"GEODETIC_DATUM_ASSUMED_WGS84\"\"], \"\"created\"\": 1262709485000, \"\"associatedmediacount\"\": 1, \"\"project\"\": \"\"GPI Georeferencing\"\", \"\"determinationnames\"\": \"\"Banksia integrifolia var. integrifolia L.f.\"\", \"\"subdepartment\"\": \"\"Gen Herb\"\", \"\"gbifid\"\": 1056816636}\""}]
    }}```