AtlasOfLivingAustralia / galah-R

Query living atlases from R
https://galah.ala.org.au
39 stars 3 forks source link

Add filters for `fileSize` and/or `imageSize` to `atlas_media`? #140

Closed daxkellie closed 1 year ago

daxkellie commented 2 years ago

It was suggested in a separate issue to add a fileSize filter for downloading media. This is potentially a good idea worth discussing, and I don't imagine it being too difficult to implement for atlas_media

Not to dump more requests into the same Issue, but I think a fileSize filter or an imageWidth/Height filter would be fantastic for uses like mine.

mjwestgate commented 2 years ago

This is an interesting one, as it may require a rethink of how atlas_media works. The current workflow for atlas_media is as follows:

atlas_media is unusual, therefore, in that it returns images as a side-effect; what is actually returned to the workspace is a tibble (as in atlas_occurrences and elsewhere), not the images themselves. Further, the current function design ensures that the resulting tibble can only be filtered once the images have already been downloaded, which is inefficient (and prevents filtering by licence type as per issue #151). A final problem is that if you have a set of media IDs, there is no way to bypass the atlas_occurrences step within atlas_media and just download the files you want.

Some useful additions would be:

A harder problem is how atlas_media should behave.

Option 1 is to force the user to do their own occurrence download first, and pass the results to atlas_media to get image metadata, i.e.

galah_call() |>
   galah_filter(year == 2022) |>
   galah_select(group = c("basic", "media")) |>
   atlas_occurrences() |>
   atlas_media() |>
   dplyr::filter(sizeInBytes < 10^6) |> # optional extra filtering stage
   get_media()

This is more modular than the current version, and doesn't require many new function names; but greatly changes the behaviour of atlas_media by forcing an intermediate call to galah_select and atlas_occurrences.

Option 2 is to keep the filtering behaviour of atlas_media unchanged, but create a new function for users who want a more modular workflow, e.g.

# basic usage
galah_call() |>
  galah_filter(year == 2022) |>
  atlas_media() |>            # returns a tibble only, but doesn't require atlas_occurrences first
  dplyr::filter(sizeInBytes < 10^6, licence == "http://creativecommons.org/licenses/by-sa/4.0/") |>
  get_media(cache = "my_cache", type = "thumbnail") # downloads images

# advanced usage
galah_call() |>
   galah_filter(year == 2022) |>
   galah_select(group = c("basic", "media")) |>
   atlas_occurrences() |>
   show_all_media() |>  # get metadata on images
   dplyr::filter(sizeInBytes < 10^6) |>
   get_media()
mjwestgate commented 2 years ago

Current status of a simple example:

x <- galah_call() |> 
  galah_filter(year == 2010) |> 
  galah_identify("Litoria peronii") |>  
  atlas_media() |> 
  collect_media(download_dir = "TEST")
28 files were downloaded to path/to/TEST

A more complex example showing how to filter by file size:

galah_call() |> 
  galah_filter(year == 2010) |> 
  galah_identify("Litoria peronii") |>  
  atlas_media() |>  
  dplyr::filter(width > 1000) |> 
  collect_media(download_dir = "TEST", type = "thumbnail")

# A tibble: 16 × 13
   media_id                             mime_type  size_in_bytes date_uploaded       date_taken          height width creator        license       data_…¹ occur…² url   downl…³
   <chr>                                <chr>              <int> <chr>               <chr>                <int> <int> <chr>          <chr>         <chr>   <chr>   <chr> <chr>  
 1 218ef2ec-8bb7-4664-922c-de39031e0d86 image/jpeg        158380 2015-11-07 04:08:14 2015-11-07 04:08:14    830  1024 Robert Bender  ""            dr893   "e5aa8… http… /Users…
 2 f90e4dff-2ecf-4468-a6af-db9e61e2e300 image/jpeg        158380 2015-10-17 04:07:58 2015-10-17 04:07:58    830  1024 Robert Bender  ""            dr893   "d8804… http… /Users…
 3 00e958ef-af1c-41c4-b5a5-c7b1bf7ccdf0 image/jpeg        689628 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 4 090f0525-3c2c-422e-b8f3-7c13a3b9319d image/jpeg        756318 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 5 3f389785-7ae0-4de6-b324-b99a56910fc9 image/jpeg        910591 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 6 734f4c29-1f2c-4cee-adb2-6f3468c00faf image/jpeg        700821 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
# … with 10 more rows, and abbreviated variable names ¹​data_resource_uid, ²​occurrence_id, ³​download_path

Finally, how to avoid atlas_media completely, first running a custom atlas_occurrences call:

df <- galah_call() |> 
  galah_filter(year == 2010, images != "") |> 
  galah_identify("Litoria peronii") |>  
  galah_select(scientificName, eventDate, images) |>
  atlas_occurrences() 

> df
# A tibble: 19 × 3
   scientificName  eventDate            images                                                                                                                                  
   <chr>           <chr>                <chr>                                                                                                                                   
 1 Litoria peronii 2010-01-12T17:07:50Z fe81d289-fc80-4f30-a6cd-b145200ba423                                                                                                    
 2 Litoria peronii 2010-11-06T22:55:34Z c6c5ed86-d83c-4e0e-9fea-187700d0a328                                                                                                    
 3 Litoria peronii 2010-01-21T13:00:00Z df220f9c-ba53-4ee2-adc6-d5bc2919c317                                                                                                    
 4 Litoria peronii 2010-01-03T10:55:28Z cbc1873c-56b4-4ff1-83bc-226164d1d079                                                                                                    
 5 Litoria peronii 2010-11-07T09:55:34Z ee4b84e1-95bb-4ff1-bfaf-ec251754f0b5                                                                                                    
 6 Litoria peronii 2010-12-21T13:00:00Z 218ef2ec-8bb7-4664-922c-de39031e0d86                                                                                                    
 7 Litoria peronii 2010-12-09T13:00:00Z 44e386a3-842e-4191-a703-0d2df8942000                                                                                                    
 8 Litoria peronii 2010-12-21T13:00:00Z f90e4dff-2ecf-4468-a6af-db9e61e2e300                                                                                                    
 9 Litoria peronii 2010-11-07T09:53:16Z 9b813545-cecc-424b-83ba-89fdd1ebdf02                                                                                                    
10 Litoria peronii 2010-12-09T13:00:00Z a8c93ae3-7a53-4af5-9897-47c9c218f1f2 | c59ee1db-c120-4ad2-9027-b6feac5dee5c                                                             
11 Litoria peronii 2010-01-02T23:55:28Z 3f99a0ee-c337-4ede-8498-25a0fa102b92                                                                                                    
12 Litoria peronii 2010-12-25T06:10:00Z 00e958ef-af1c-41c4-b5a5-c7b1bf7ccdf0 | 090f0525-3c2c-422e-b8f3-7c13a3b9319d | 3f389785-7ae0-4de6-b324-b99a56910fc9 | 734f4c29-1f2c-4cee…
13 Litoria peronii 2010-02-23T08:46:49Z 09df8085-2a94-43c7-a9f1-6acfb0d03186                                                                                                    
14 Litoria peronii 2010-11-06T22:53:16Z 332a5582-ec9e-4471-b00d-e31e35bae3c6                                                                                                    
15 Litoria peronii 2010-02-01T01:17:00Z 6873f70f-bba1-4bc3-a3a3-528426f5e319                                                                                                    
16 Litoria peronii 2010-02-23T08:46:49Z 003f0cd0-70bd-4b28-98ef-955eba0470de | 2a6c5fc7-c3ac-4832-a572-cd248357ac47 | 61015de5-662d-437c-b603-51247ed1c063                      
17 Litoria peronii 2010-02-23T08:46:04Z ee82cc94-a1f8-48a1-8457-c941d55f376f                                                                                                    
18 Litoria peronii 2010-02-23T08:46:04Z 900e6776-c9d5-464b-820d-669dccd90ecc                                                                                                    
19 Litoria peronii 2010-02-23T08:46:49Z f8b8f786-a579-4f0c-a13f-2cdce9537c04    

Then getting associated media:

df |> 
  show_all_media() |>
  dplyr::filter(width > 1000) |> 
  collect_media(download_dir = "TEST", type = "thumbnail")

16 files were downloaded to /Users/wes186/Documents/Work/Development/AtlasOfLivingAustralia/Package_galah/galah/TEST
# A tibble: 16 × 13
   media_id                             mime_type  size_in_bytes date_uploaded       date_taken          height width creator        license       data_…¹ occur…² url   downl…³
   <chr>                                <chr>              <int> <chr>               <chr>                <int> <int> <chr>          <chr>         <chr>   <chr>   <chr> <chr>  
 1 df220f9c-ba53-4ee2-adc6-d5bc2919c317 image/jpeg        271384 2021-06-25 11:55:09 2021-06-25 11:55:09    768  1024 bpalmerau      "http://crea… dr1411  ""      http… /Users…
 2 218ef2ec-8bb7-4664-922c-de39031e0d86 image/jpeg        158380 2015-11-07 04:08:14 2015-11-07 04:08:14    830  1024 Robert Bender  ""            dr893   "e5aa8… http… /Users…
 3 44e386a3-842e-4191-a703-0d2df8942000 image/jpeg        123787 2014-05-20 12:02:41 2014-05-20 12:02:41    778  1024 Ken Walker     ""            dr893   "5f420… http… /Users…
 4 f90e4dff-2ecf-4468-a6af-db9e61e2e300 image/jpeg        158380 2015-10-17 04:07:58 2015-10-17 04:07:58    830  1024 Robert Bender  ""            dr893   "d8804… http… /Users…
 5 a8c93ae3-7a53-4af5-9897-47c9c218f1f2 image/jpeg        103675 2019-09-12 18:27:42 2019-09-12 18:27:42    770  1024 Ken Walker     "http://crea… dr1411  "ac3db… http… /Users…
 6 c59ee1db-c120-4ad2-9027-b6feac5dee5c image/jpeg        121007 2019-07-06 11:08:39 2019-07-06 11:08:39    778  1024 Ken Walker     "http://crea… dr1411  "ac3db… http… /Users…
 7 00e958ef-af1c-41c4-b5a5-c7b1bf7ccdf0 image/jpeg        689628 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 8 090f0525-3c2c-422e-b8f3-7c13a3b9319d image/jpeg        756318 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
 9 3f389785-7ae0-4de6-b324-b99a56910fc9 image/jpeg        910591 2019-09-05 23:16:51 2019-09-05 23:16:51   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
10 734f4c29-1f2c-4cee-adb2-6f3468c00faf image/jpeg        700821 2019-09-05 23:16:52 2019-09-05 23:16:52   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
11 9b651b9b-5547-4ae1-bcb9-dab15ac86c29 image/jpeg        702069 2015-10-21 12:08:57 2015-10-21 12:08:57   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
12 ee84d419-80a5-4386-8cfe-02a88a53238b image/jpeg        659726 2019-09-05 23:16:50 2019-09-05 23:16:50   1365  2048 Niko Pax       "http://crea… dr1411  "d9f04… http… /Users…
13 6873f70f-bba1-4bc3-a3a3-528426f5e319 image/jpeg        587538 2020-06-08 03:12:28 2020-06-08 03:12:28   1207  1811 davidcoleby    "http://crea… dr1411  "150bf… http… /Users…
14 003f0cd0-70bd-4b28-98ef-955eba0470de image/jpeg        833794 2020-06-08 03:11:00 2020-06-08 03:11:00   1366  2048 Arthur Chapman "http://crea… dr1411  "9f633… http… /Users…
15 2a6c5fc7-c3ac-4832-a572-cd248357ac47 image/jpeg        890271 2020-06-08 03:10:50 2020-06-08 03:10:50   1366  2048 Arthur Chapman "http://crea… dr1411  "9f633… http… /Users…
16 61015de5-662d-437c-b603-51247ed1c063 image/jpeg       1154857 2020-06-08 03:11:05 2020-06-08 03:11:05   1366  2048 Arthur Chapman "http://crea… dr1411  "9f633… http… /Users…
# … with abbreviated variable names ¹​data_resource_uid, ²​occurrence_id, ³​download_path

It would not be impossible to walk back these changes back into atlas_media; but I think that splitting the metadata download from the media download is quite sensible.

mjwestgate commented 1 year ago

This is now possible using dplyr:filter on tibble returned by atlas_media