Closed nickdos closed 1 year ago
Adding a second example from another user.
Should show 4155 records with DQ filters on ALA general.
Click download, select "customised download" format, use the "select all" button top of page, click next.
Resulting download should contain 4155 records but only has 963 data rows. ~Also the ZIP file is missing the extra files: README.html, citations.csv, headings.csv and DOI.txt.~
If you only select half of the options on the "customised" page, then the download works as expected. This suggests we are hitting a limit for a URL or param string in the download service.
the reduced records in the CSV appears to be because there was a failure with the streaming query the exception message is null object
originating at https://github.com/AtlasOfLivingAustralia/biocache-service/blob/b92b19049ebb8caa86b9bf738b9ebc681c54ba3e/src/main/java/au/org/ala/biocache/dao/SolrIndexDAOImpl.java#L1074
The error logged:
2021-10-07 14:38:52,639 [biocache-query-offline-7] ERROR au.org.ala.biocache.dao.SolrIndexDAOImpl (SolrIndexDAOImpl.java:389) - Exception - query failed - SOLRQuery: q=lft:[579944+TO+579944]&fq=-outlierLayerCount:[3+TO+*]&fq=-coordinateUncertaintyInMeters:[10001+TO+*]&fq=-year:[*+TO+1700]&fq=-establish
mentMeans:"MANAGED"+AND+-decimalLatitude:0+AND+-decimalLongitude:0+AND+-assertions:"PRESUMED_SWAPPED_COORDINATE"+AND+-assertions:"COORDINATES_CENTRE_OF_STATEPROVINCE"+AND+-assertions:"COORDINATES_CENTRE_OF_COUNTRY"+AND+-assertions:"PRESUMED_NEGATED_LATITUDE"+AND+-assertions:"PRESUMED_NEGATED_LONGITUDE"&
fq=-basisOfRecord:"FOSSIL_SPECIMEN"+AND+-(basisOfRecord:"MATERIAL_SAMPLE"+AND+contentTypes:"EnvironmentalDNA")&fq=-userAssertions:50001+AND+-userAssertions:50005&fq=-occurrenceStatus:ABSENT&fq=-spatiallyValid:"false"&fq=-assertions:TAXON_MATCH_NONE+AND+-assertions:INVALID_SCIENTIFIC_NAME+AND+-assertions
:TAXON_HOMONYM+AND+-assertions:UNKNOWN_KINGDOM+AND+-assertions:TAXON_SCOPE_MISMATCH&fq=-duplicateType:"DIFFERENT_DATASET"&rows=-1&start=0&fl=dataResourceUid,imageIDs,raw_recordedBy,modified,language,license,rightsHolder,accessRights,bibliographicCitation,references,institutionID,collectionID,datasetID,i
nstitutionCode,collectionCode,datasetName,ownerInstitutionCode,basisOfRecord,informationWithheld,dataGeneralizations,dynamicProperties,provenance,rights,source,type,occurrenceID,catalogNumber,recordNumber,recordedBy,individualCount,organismQuantity,organismQuantityType,sex,lifeStage,reproductiveConditio
n,behavior,establishmentMeans,occurrenceStatus,preparations,disposition,associatedMedia,associatedReferences,associatedSequences,associatedTaxa,otherCatalogNumbers,occurrenceRemarks,organismID,id,organismName,associatedOccurrences,previousIdentifications,eventID,parentEventID,fieldNumber,eventDate,event
Time,year,month,day,verbatimEventDate,habitat,samplingProtocol,samplingEffort,sampleSizeValue,sampleSizeUnit,fieldNotes,eventRemarks,locationID,higherGeography,continent,waterBody,islandGroup,island,country,countryCode,stateProvince,county,municipality,raw_locality,verbatimLocality,minimumElevationInMet
ers,maximumElevationInMeters,verbatimElevation,verbatimElevation,minimumDepthInMeters,maximumDepthInMeters,verbatimDepth,verbatimDepth,locationAccordingTo,locationRemarks,decimalLatitude,decimalLongitude,geodeticDatum,coordinateUncertaintyInMeters,coordinatePrecision,verbatimCoordinates,raw_decimalLatit
ude,verbatimLatitude,raw_decimalLongitude,verbatimLongitude,raw_geodeticDatum,verbatimCoordinateSystem,verbatimSRS,footprintSRS,georeferencedBy,georeferencedDate,georeferenceProtocol,georeferenceSources,georeferenceVerificationStatus,georeferenceRemarks,identificationID,identificationQualifier,typeStatu
s,identifiedBy,dateIdentified,identificationReferences,identificationVerificationStatus,identificationRemarks,identifierRole,taxonID,scientificNameID,acceptedNameUsageID,taxonConceptID,scientificName,acceptedNameUsage,parentNameUsage,originalNameUsage,nameAccordingTo,namePublishedIn,higherClassification
,kingdom,phylum,class,order,family,genus,subgenus,specificEpithet,infraspecificEpithet,taxonRank,verbatimTaxonRank,scientificNameAuthorship,vernacularName,nomenclaturalCode,taxonomicStatus,nomenclaturalStatus,taxonRemarks,species,measurementDeterminedDate,measurementRemarks,measurementValue,measurementM
ethod,measurementID,measurementType,measurementUnit,measurementDeterminedBy,measurementAccuracy,countryConservation,stateConservation,speciesGroup,speciesSubgroup,el704,el830,el2094,el772,el722,el729,el819,el898,el2099,el848,el891,el790,el765,el711,el797,el718,el887,el894,el2043,el725,el671,el1013,el591
,el715,el841,el1036,el1074,el708,el783,el674,el793,el1078,el1073,el2016,el682,el1079,el789,el827,el1037,el726,el862,el949,el2089,el2101,el743,el786,el2102,el737,el645,el888,el2093,el863,el996,el955,el672,el2126,el1077,el681,el950,el744,el870,el2044,el2095,el798,el948,el791,el890,el2042,el705,el751,el792
,el2091,el876,el843,el666,el816,el810,el1019,el883,el2119,el785,el1056,el836,el889,el1020,el673,el662,el947,el2018,el788,el814,el795,el1080,el680,el2092,el2097,el878,el731,el774,el781,el954,el784,el1011,el867,el749,el951,el2088,el665,el875,el2100,el882,el647,el1002,el720,el892,el713,el799,el766,el2104,e
l1006,el989,el2103,el872,el728,el796,el787,el879,el1038,el881,el746,el730,el1072,el719,el2017,el670,el2098,el865,el794,el609,el957,el2090,el668,el874,el598,el683,el2096,el615,el952,el1081,el860,el767,el893,el742,el886,el806,el676,el747,el820,el782,el663,el866,el833,el707,el899,el1055,cl10930,cl10941,cl2
012,cl1070,cl987,cl110924,cl620,cl10936,cl10947,cl941,cl1061,cl962,cl10903,cl10925,cl613,cl2077,cl2021,cl10847,cl991,cl2049,cl1060,cl901,cl10909,cl2078,cl10937,cl110944,cl605,cl10942,cl2086,cl2087,cl2045,cl1065,cl1066,cl110948,cl23,cl2013,cl10848,cl2081,cl10924,cl12115,cl10946,cl10902,cl927,cl10938,cl21
16,cl110957,cl929,cl10927,cl12079,cl1918,cl619,cl1076,cl2085,cl22,cl990,cl1062,cl936,cl10943,cl2109,cl110923,cl3004,cl2125,cl12081,cl10831,cl10940,cl988,cl510927,cl10945,cl606,cl110925,cl1048,cl1057,cl916,cl10834,cl908,cl10948,cl1085,cl923,cl1053,cl11061,cl10935,cl614,cl910,cl1051,cl963,cl10907,cl930,cl
210927,cl1059,cl611,cl2124,cl618,cl1054,cl1942,cl10933,cl1063,cl3,cl928,cl2084,cl10922,cl1068,cl10900,cl10833,cl964,cl21,cl2110,cl932,cl10906,cl939,cl10939,cl935,cl10928,cl310927,cl10956,cl3003,cl10934,cl110928,cl917,cl2010,cl2079,cl10912,cl942,cl1069,cl2105,cl2052,cl10872,cl410927,cl911,cl1064,cl10921,
cl2048,cl10832,cl2083,cl10829,cl10955,cl914,cl2080,cl2015,cl907,cl110922,cl110927,cl20,cl2050,cl1067,cl938,cl913,cl10835,cl2009,cl2076,cl612,cl2120,cl1058,cl10905,cl678,cl617,cl1071,cl2111,cl110945,cl965,cl906,cl1049,cl961,cl2020,cl10908,cl2022,cl10923,cl959,cl925,cl958,cl1052,cl960,cl2113,cl10926,cl926
,cl12078,cl10910,cl2115,cl12080,cl2117,cl966,cl10944,cl604,cl940,cl10957,cl1050,cl12021,cl1084,cl10929,cl10874,assertions,dataProviderUid,institutionUid,collectionUid,lft,rgt,sensitive : null object
The resulting exception is effectively ignored because the we are looping until the download thread isDone
resulting in a partially completed download with nothing reported to the user.
The error streaming the download related to the parsing of dynamicProperties
json text content.
There where two issues
Tuple::getString returns the string "null" for null
values which fail to parse as json
https://github.com/AtlasOfLivingAustralia/biocache-service/blob/b92b19049ebb8caa86b9bf738b9ebc681c54ba3e/src/main/java/au/org/ala/biocache/stream/ProcessDownload.java#L313
there are occurrence records with invalid json for dynamicProperties
see 1a3b6c20-e9bd-44a3-a625-f03be23695e8
which has double escaped quote chars /"/"
http://nci3-solr-1.ala:8983/solr/biocache/select?fl=dynamicProperties&fq=id%3A%221a3b6c20-e9bd-44a3-a625-f03be23695e8%22&q.op=OR&q=lft%3A%5B579944%20TO%20579944%5D&rows=10&start=0
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":461,
"params":{
"q":"lft:[579944 TO 579944]",
"fl":"dynamicProperties",
"start":"0",
"q.op":"OR",
"fq":"id:\"1a3b6c20-e9bd-44a3-a625-f03be23695e8\"",
"rows":"10",
"_":"1633578588360"}},
"response":{"numFound":1,"start":0,"maxScore":1.0,"numFoundExact":true,"docs":[
{
"dynamicProperties":"\"{\"\"determinationfiledas\"\": \"\"Yes\"\", \"\"collectionkind\"\": \"\"Sheet\"\", \"\"gbifissue\"\": [\"\"GEODETIC_DATUM_ASSUMED_WGS84\"\"], \"\"created\"\": 1262709485000, \"\"associatedmediacount\"\": 1, \"\"project\"\": \"\"GPI Georeferencing\"\", \"\"determinationnames\"\": \"\"Banksia integrifolia var. integrifolia L.f.\"\", \"\"subdepartment\"\": \"\"Gen Herb\"\", \"\"gbifid\"\": 1056816636}\""}]
}}```
Reported by a user (ticket 115306) and Looks similar to #678.
See this download DOI: https://doi.ala.org.au/doi/10.26197/ala.aa4b433a-efa7-4e7f-9eba-aeb16476a24c
The total records is 1,315,567 but download only contains 441 records and the datasets breakdowns also show 432 & 8 values.
I'm wondering if this filter is the problem:
filter: -basisOfRecord:"FOSSIL_SPECIMEN" AND -(basisOfRecord:"MATERIAL_SAMPLE" AND contentTypes:"EnvironmentalDNA")