ecoscape-earth / ecoscape-layers

A tool for generating the species-specific input needed to run ecoscape (habitat, terrain and resistance map)
BSD 2-Clause "Simplified" License
0 stars 0 forks source link

Allow Use of Scientific Names & Improvements for Bulk Computation #2

Closed aaditysharma1 closed 2 months ago

aaditysharma1 commented 3 months ago

Hey Jasmine, thank you for the changes. I have implemented most of them. I'll set up an environment and test this thoroughly as I start the next task, which will require me to make further changes.

jasminetai commented 3 months ago

Sounds good, just let me know when you have tested well and documented the changes for review (including in the README).

jasminetai commented 3 months ago

at a glance:

aaditysharma1 commented 3 months ago

You are right, seeing that since range_src is already there as a check it seems redundant to have an extra parameter like ebird_code. It adds to the code without increasing functionality.

aaditysharma1 commented 3 months ago

I'll get back to you once I test it as well.

seansiddens commented 3 months ago

What format is the scientific name expected to be in? Right now I'm using the scientific names from this sheet: https://docs.google.com/spreadsheets/d/1-SNBjZBGjG8dV72t8JPS-YSV9oa_ybeo/edit#gid=2092976143

  1. Just to confirm, these scientific names are compatible, correct?
  2. Should they be passed in some format (like all lowercase) or can the names just be left as is?
jasminetai commented 3 months ago

Yes, I think those names should be compatible. Ideally, my feeling is that capitalization should not matter for ease of use, but this may be something for @aaditysharma1 to comment on / bear in mind while testing.

IanHollow commented 3 months ago

Here are the performance improvements stats for a small and large area:

California Bounds (-124.41060660766607, 32.5342307609976, -114.13445790587905, 42.00965914828148)

Sub-Saharan Africa (-17.6, -35, 52, 27.3)

Overall, it is about 2 times faster. Results on your end might vary based upon your hardware but during testing adding more ram seemed to give the biggest performance increase. @seansiddens would probably appreciate these performance improvements as each time as user on the browser computes an area I would assume that it uses warp.

IanHollow commented 3 months ago

I get an error in the function get_range_from_iucn: RuntimeError: "sci_name" not recognised as an available field.

https://github.com/ecoscape-earth/ecoscape-layers/blob/7eef1bf458508db742d45e8555609c7492c56004/ecoscape_layers/layers.py#L64-L70

I wrote some code to view the available fields:

field_names = [
    input_layer_defn.GetFieldDefn(i).GetName()
    for i in range(input_layer_defn.GetFieldCount())
]
print("Available fields:", field_names)

The output was: Available fields: ['Sequence', 'Order_', 'FamilyName', 'Family', 'Subfamily', 'Tribe', 'CommonName', 'ScientificName', 'Authority', 'F2022_IUCN_RedList_Category', 'Synonyms', 'AlternativeCommonNames', 'TaxonomicSource', 'SISID']

I changed the code to this to fix the error:

  # Open input file and layer, and apply attribute filter using scientific name
  input_src = ogr.Open(input_ranges_gdb, 0)
  input_layer = input_src.GetLayer()
  input_layer_defn = input_layer.GetLayerDefn()
  input_layer.SetAttributeFilter(f"ScientificName = '{species_name}'")
  input_spatial_ref = input_layer.GetSpatialRef()
  input_spatial_ref.MorphToESRI()

However, I now get a new error: AttributeError: 'NoneType' object has no attribute 'MorphToESRI' https://github.com/ecoscape-earth/ecoscape-layers/blob/7eef1bf458508db742d45e8555609c7492c56004/ecoscape_layers/layers.py#L70-L70

@jasminetai do you know why this error could be happening?

IanHollow commented 3 months ago

If we converted IUCN Scientific names into ebird names we could use the ebird range maps which seem to work. However, using IUCN range maps are part of the motivation of this PR therefore the error above needs to be addressed.

jasminetai commented 3 months ago

hmm, that’s strange. It seems that GDAL can’t find a spatial reference associated with the layer.

I am probably not going to have much time to run the code in the next few days, but check: are you passing in the gdb folder correctly? If this error appears even without setting the attribute folder, then something might be off about your input.

In addition, as I debugged something with another person regarding this somewhat recently, can you let me know where you downloaded the IUCN range maps gdb from? It is possible that the download source was not good if there are any “zone identifier” files there (it was a minor mistake by the uploader).

IanHollow commented 3 months ago

hmm, that’s strange. It seems that GDAL can’t find a spatial reference associated with the layer.

I am probably not going to have much time to run the code in the next few days, but check: are you passing in the gdb folder correctly? If this error appears even without setting the attribute folder, then something might be off about your input.

In addition, as I debugged something with another person regarding this somewhat recently, can you let me know where you downloaded the IUCN range maps gdb from? It is possible that the download source was not good if there are any “zone identifier” files there (it was a minor mistake by the uploader).

@jasminetai I downloaded the BOTW.gdb.zip file from the bird and cs google drive. Then I unzipped it and used a path to that folder. I also tried the already unzipped folder on the Google drive but I got the same results. I could try to look into though more but I'm just struggling to understand the issue. I wonder why the attribute name suddenly changed names. Has a package been updated or has the data on the Google drive been updated?

jasminetai commented 3 months ago

In addition, as I debugged something with another person regarding this somewhat recently, can you let me know where you downloaded the IUCN range maps gdb from? It is possible that the download source was not good if there are any “zone identifier” files there (it was a minor mistake by the uploader).

Please check this; I have just checked, and this error you have shared on the attribute name is the same as the one that arose from this issue for another person. There was a data source I uploaded here and a data source that was uploaded for the EcoScape Browser; however, for some time, the latter was bad data, maybe because of those zone identifier files. So I think you should redownload or try to remove those files.

IanHollow commented 3 months ago

I seemed to be missing the file a00000009.gdbtable. I will download this file and try to test again.

IanHollow commented 3 months ago

@jasminetai Thanks for you help everything seems to be fixed now. @aaditysharma1 is going to make a couple changes related removing ebird then everything should be done with this PR.

jasminetai commented 3 months ago

oh nice! the changes look nice and really promising w/ the performance improvement, but will probably review later once Aadity makes those changes then

IanHollow commented 3 months ago

Here are the performance improvements stats for a small and large area:

California Bounds (-124.41060660766607, 32.5342307609976, -114.13445790587905, 42.00965914828148)

* Original    = 15s

* mem & multi = 8s

Sub-Saharan Africa (-17.6, -35, 52, 27.3)

* Original    = 11m

* mem & multi = 6m

Overall, it is about 2 times faster. Results on your end might vary based upon your hardware but during testing adding more ram seemed to give the biggest performance increase. @seansiddens would probably appreciate these performance improvements as each time as user on the browser computes an area I would assume that it uses warp.

These number are slightly wrong because I originally wrote the code incorrectly for the memory performance improvements. The issue came from the gdal python docs saying the that GDAL Warp Options max memory option was in MB however through testing it is actually in bytes. With this change to units being in bytes the performance has drastically increased. Here are the new numbers.

California Bounds (-124.41060660766607, 32.5342307609976, -114.13445790587905, 42.00965914828148)

These results show about a 10x improvement in the Sub-Saharan Africa test. It would good to know that I have a lot of ram on my computer which help achieved this speed. The test had access to around 25GB of memory and it seem to use around 15GB at peek during the warp computation.

aaditysharma1 commented 3 months ago

@jasminetai I added the needed changes to completely remove dependency on ebird. I would really appreciate your feedback. One question I have is, since we are only using IUCN now, is it in our best interest for us to remove the range_src parameter? I think it might be a bit redundant since we only have one source now.

aaditysharma1 commented 3 months ago

I'll make the changes to the documentation once we completely finalize this. Which Python notebook should I update regarding new changes?

jasminetai commented 3 months ago

Oh, to clarify: I think the IUCN layer generation should indeed be independent of eBird (i.e. not need an eBird key), but this does not mean we need to remove the option for generating layers with eBird, which would evidently require eBird input. I know that at least one person is using this package to produce layers with eBird range maps, so it is in our best interest to keep range_src as a parameter and the eBird key parameter (but as an optional input), even if we ourselves may not use it necessarily. And since our EcoScape paper results also used eBird range maps, we do want those results to be reproducible in some way with the package!

You might need to modify both the paper layers notebook and the test layers notebook, depending on what changes you have opted to make — but they should be minor changes, such as editing the input species names. Since I am not sure that either notebook uses IUCN range maps for the range source, you might consider creating and testing a new “test notebook” that shows how you can use IUCN as the range source.

aaditysharma1 commented 3 months ago

I've added a new test Colab Notebook to the drive, with minor changes to the call to LayerGenerator. I will add further changes as I go through the Test Layers Notebook and test this. Here is a link to the notebook:

https://colab.research.google.com/drive/1zVyaveYHnmLvHCtPnuPShs_1PDYO8Ynn

IanHollow commented 3 months ago

I've added a new test Colab Notebook to the drive, with minor changes to the call to LayerGenerator. I will add further changes as I go through the Test Layers Notebook and test this. Here is a link to the notebook:

https://colab.research.google.com/drive/1zVyaveYHnmLvHCtPnuPShs_1PDYO8Ynn

I updated the docs on the colab. however there still needs to be a little bit more done on the colab before it is complete. Also, the repo's README needs to be updated to adhere to the new changes made in this PR.

IanHollow commented 3 months ago

I added a function to the utils that can convert the cords from one crs to another

IanHollow commented 2 months ago

This issue #3 has now been fixed as of recent commits to this PR. For those would like to know more about what the issue was, it is explained in more detail on the issue page #3.

The total computation time of creating the terrain/landcover layer and then generating the habitat layer was improved greatly (about 2x improvement).

For California:
   19s ->     8s
Sub-Saharan Africa:
4m  9s -> 2m 20s

Both of these rounded down to be conservative have a 2x speed up. However, it is important to know that this speedup will only be seen in full effect for birds species with the same terrain/landcover layer. This is great improvement for bulk computations. There are also improvements that are not shown here but are on the issue page #3 for birds with different terrain/landcover layers.

IanHollow commented 2 months ago

I've added a new test Colab Notebook to the drive, with minor changes to the call to LayerGenerator. I will add further changes as I go through the Test Layers Notebook and test this. Here is a link to the notebook: colab.research.google.com/drive/1zVyaveYHnmLvHCtPnuPShs_1PDYO8Ynn

I updated the docs on the colab. however there still needs to be a little bit more done on the colab before it is complete. Also, the repo's README needs to be updated to adhere to the new changes made in this PR.

The Colab notebook has been updated with all changes made in this PR. The only thing left to do is update the README file to reflect the new changes.

Here is a link to the colab notebook

IanHollow commented 2 months ago

@jasminetai I believe this PR is ready to be merged into main. Could you look over it when you have a chance.

IanHollow commented 2 months ago

I like the changes; you did a lot of good work here, and again, the performance improvement is great. Some small comments:

* I changed your example notebook in Colab to download the source landcover tif in a simpler way; check it, and remove the old code if it is good. (I think the code you have there did not work for me due to the name of the zip file containing the "?download=1" query param.)

* The formatting changes are a little bit weird in certain places, such as here, where I would expect the single strings to be on the same line as the parentheses; could this maybe be changed back?
  ![image](https://private-user-images.githubusercontent.com/42330503/341667551-e5c22b0a-b6e2-456b-b764-3159feb1c2db.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTg5OTI5MjEsIm5iZiI6MTcxODk5MjYyMSwicGF0aCI6Ii80MjMzMDUwMy8zNDE2Njc1NTEtZTVjMjJiMGEtYjZlMi00NTZiLWI3NjQtMzE1OWZlYjFjMmRiLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA2MjElMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNjIxVDE3NTcwMVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWY5NzE4OTY0ZmNiODc5ZmJkOTlmMzk5Mjc3NmUxMzM2NDA4MDg0NDE5NDRiM2MzZjYzMmEzNDg2NmU4YWY0MjMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.SDkvDOd0fS4hKQtG6dy8riy0OPt9t8fXu4eChVG-2uo)

* To check, you were able to run the notebooks in `/tests` (excluding the old paper layers one) and confirm that they worked as well, right? And that the functionality works locally in general?

Other than these bits, I think this is just about good to merge. (It may take longer for me to get back to you because I am away at my internship, but I will be checking here for responses.)

@jasminetai

What do think the best steps forward are?

jasminetai commented 2 months ago

hmm, ok. That is indeed strange, but thanks for the thorough analysis. I agree with your intuition that we should continue to try to support eBird range maps if possible. What you can check:

jasminetai commented 2 months ago

When you think it is all ready, I can merge this work and push a new version of the package for those on the ecoscape browser effort. But the eBird issue should definitely be looked into, perhaps in a future branch if it is not simply an expired key.

seansiddens commented 2 months ago

If this can get merged and updated to a new version within the next couple days that would be really useful for us.

IanHollow commented 2 months ago

@jasminetai Good news the ebird key that was in the Google driver was expired. All the test cases in this repo passed after using a new key from ebird. Just keep note it took 7 minutes to run the paper layers test. I don't know how long it took before but hopefully it is faster now based on all of the improvements.

@seansiddens I will merge now because all issue have been resolved and all features of the PR have been completed. Hopefully these changes will speed up the ecoscape-browser computation as well.

jasminetai commented 2 months ago

That's great! If you have not done it already, @IanHollow, could you update the config.py in the shared drive (I believe in the Colab folder) so that the new key can be referenced by everyone?

I made very minor changes to the docs and have just now updated the package on PyPi.

IanHollow commented 2 months ago

I updated the ebird api key in the Colab folder. I also put the expiration date so that we know when it needs to be updated again. Thanks for your help Jasmine.