SpeciesFileGroup / taxonworks

Workbench for biodiversity informatics.
http://taxonworks.org
Other
84 stars 25 forks source link

Collection Objects filter task= prolonged issues downloading CSV #2210

Open lvhart2 opened 3 years ago

lvhart2 commented 3 years ago

I want to preface this by saying that I understand most of these issues I will mention are a work in progress and will be updated eventually, but wanted to draw much needed attention to them. I am not an API person so shortcuts that way would be difficult for me.

Quite regularly I need to download data in CSV from CO filter task.

  1. For example, currently I need to download all data on Trichoptera specimens collected from the years 2017-2020. I can do this MOST EFFICIENTLY by adding "Trichoptera" as a Taxon name under Determinations, followed by entering "2020" in the Collecting Event field under Buffered since entering dates under Date Range is not useful. This usually spits out the specimens I need, with a few sprinkled in that do not apply to that year, but have "2020" somewhere in their buffered label, be it coordinates or collecting times. This results in having to delete rows of data that are not needed in the CSV after it is downloaded.

  2. If there are more than 500 records per page, the CVS will not include more than those first 500 records, despite changing the number of records per page to 1000.

    Screen Shot 2021-04-26 at 11 14 56 AM Screen Shot 2021-04-26 at 11 15 34 AM Screen Shot 2021-04-26 at 11 16 18 AM

    I can get around this bug by downloading each page of 500 records separately and condensing them into one file on Numbers or Xcel.

  3. Lastly, if you reference my last screenshot image above, you will notice that Catalogue numbers in the CSV are omitted for containerized specimens or specimens that need to be reindexed. This is by far the issue that gives me the most grief! I have to go back and manually enter each catalogue number in the spreadsheet before I send it to my supervisor.

What could take just a few minutes of data retrieval ends up taking me days to finish due to these issues. I worry that someone who is not familiar with Taxonworks would become greatly discouraged if they wanted to search specimens to maybe request a loan or just generally gather data.

Thank you in advance!

lvhart2 commented 3 years ago

Sorry I forgot to annotate each image. First: shows 1000 records per page Second: shows the catalogue number of the last record for that page Third: shows the last records of the downloaded CSV list, which do not reflect what is in TW because only the first 500 are included on that list.

mjy commented 3 years ago

@jlpereira the CSV is client side right?

@lvhart2 we'll try and prioritize the easy stuff. Some of this will be improved/is dealt with on other branches, particularly the DwC export branch.

Thanks very much for adding this, and don't hesistate to make these requests when issues are systematic for you.

LocoDelAssembly commented 3 years ago

https://github.com/SpeciesFileGroup/taxonworks/blob/e74814dc6d97eb0b32f3d92671ab81a613ae93a2/app/controllers/collection_objects_controller.rb#L357

That must be preventing the CSV download to be larger than 500. Handcrafting the request with a larger per param allows to download much more.

Would be reasonable to disable pagination at backend if explicitly requested by frontend (with some new param)?

jlpereira commented 3 years ago

https://github.com/SpeciesFileGroup/taxonworks/blob/e74814dc6d97eb0b32f3d92671ab81a613ae93a2/app/controllers/collection_objects_controller.rb#L357

That must be preventing the CSV download to be larger than 500. Handcrafting the request with a larger per param allows to download much more.

Would be reasonable to disable pagination at backend if explicitly requested by frontend (with some new param)?

Related: https://github.com/SpeciesFileGroup/taxonworks/issues/2121

If i remember well, the last time i tried to increment the records per page, the maximum i got was 20.000 records.

mjy commented 3 years ago

@lvhart2 Can you provide a little more information about what you're doing with the data? Are there just a couple columns that you need? What is your boss doing with the data? Things that take days really shouldn't, we need to address the workflow specifically I suspect.

lvhart2 commented 3 years ago

I work for Ed DeWalt. He wants to be able to download data of past collecting years so he can have an idea of what he found and where. He usually has me do all of the TW work for him, but I know we need all of the columns except for -collection_objects-id -dwc_occurrence_id -dwc_occurrence_object type -institutional code & id -nomenclatural code

He has also mentioned that it would be helpful for the individual count to reflect the actual biocuration as well. Currently I have to manually enter that data if he needs it. And possibly to be able to search by trip code/accession code for collection objects and have that as a column in the downloaded csv.

Feel free to reach out to him directly. dewalt@illinois.edu

Thanks!

mjy commented 2 years ago

@lvhart2 I believe we might be able to close this issue with the new DwC download functionality. Let us know if so.

debpaul commented 2 years ago

@mjy status? do we need @lvhart2 to test using the DwC download function?

hhopkins77 commented 4 months ago

I brought this issue to the meeting on 2/21/24 with the difference that I was generating search results under Filter Nomenclature and need to be able to download the entirety of the resulting data, no matter how many entries it may involve. Currently each download is limited to 2500 results so I have to download separate pages of 2500 and then combine the downloads into one document.

lvhart2 commented 4 months ago

@hhopkins77 I'm glad you brought this up because twice this week I have tried to download CO data as a DwC file and it never seems to load. Months ago I filtered data and downloaded it no problem- it was really to download in a couple minutes. What I need is Plecoptera with descendants from the state of Arkansas. The file contains 2721 records. @mjy maybe if you're around tomorrow I will try to show you what issues I am experiencing.

mjy commented 4 months ago

@lvhart2 there is a known bug in the downloader that we should have fixed tomorrow, sorry for this. Indeed 2k records should be available in the hub in 1 minute or two, if they are not please open bugs here in the future.

hhopkins77 commented 4 months ago

@mjy Is this the issue you were speaking of in the meeting yesterday? It seemed to be the right one, so I added my comment above, but I want to make sure I am on the right issue.