Deliverable queries - so far

brahmwg / Bottlenecks_MDS_Capstone

Master of Data Science Capstone Project for Bottlenecks to Survival

0 stars 0 forks source link

Deliverable queries - so far #8

Closed AReyH closed 4 months ago

AReyH commented 6 months ago

This is for us to share the status of the queries we have so far.

AReyH commented 6 months ago

For each stage (i.e. hatchery/river, estuary, microtroll, adult), what are the number of tags deployed to date or over a specific time frame?

SELECT stage,
      COUNT(DISTINCT(tag_id_long))
FROM hatch_tag 
WHERE stage IS NOT NULL
AND date_time_release BETWEEN '2018-01-01' -- Change this for initial date
                          AND '2024-01-01' -- Change this for after date
GROUP BY stage

Edited query based on @brahmwg 's feedback

SELECT CASE 
            WHEN h.stage = 'NA' THEN 'Hatchery'
            ELSE h.stage
            END,
         'hatchery' as tag_status,
         COUNT(DISTINCT(h.tag_id_long)) AS tag_count
FROM hatch_tag h
WHERE h.tag_id_long IS NOT NULL
AND date_time_release > '2021-01-01'
GROUP BY h.stage

UNION ALL

SELECT 'Field' as stage,
         COALESCE(CASE 
                    WHEN f.tag_status = 'no tag' THEN 'notag'
                    ELSE f.tag_status END, 'NA') AS tag_status,
         COUNT(DISTINCT(f.tag_id_long)) AS tag_count

FROM field f
WHERE f.tag_id_long IS NOT NULL
AND f.tag_status NOT IN ('recap', 'mort', 'reject')
AND date > '2021-01-01'
GROUP BY f.tag_status

UNION ALL

SELECT 'Microtroll' as stage,
         COALESCE(m.tag_status,'NA'),
         COUNT(DISTINCT(m.tag_id_long)) AS tag_count

FROM microtroll m
WHERE m.tag_id_long IS NOT NULL
AND m.tag_status NOT IN ('recap', 'mort', 'reject')
AND date > '2021-01-01'
GROUP BY m.tag_status

For the microtroll table there is a 0 value for tag_status. How should we handle this value?

brahmwg commented 6 months ago

This query may also need to consider the number of tags deployed in field and in microtroll. PIT tagging fish can occur at hatcheries, or in the wild during field work (at beaches, etc) or microtrolling. Be aware of the tag_status column, which may indicate whether a particular fish was tagged, recaptured a previously tagged fish, etc.

rafecchang commented 6 months ago

this is what I have so far for: For each system/watershed, list all tags deployed that were subsequently redetected on the lowest mainstem array or in the estuary immediately after outmigration (to look at freshwater survival rather than overall survival and adult returns).

I have yet included "estuary immediately after outmigration"
to filter through date to ensure is not return, what should the window be? Can I use the same approach to determine fresh water instead of overall survival?

SELECT DETECTIONS.* 
FROM DETECTIONS
LEFT JOIN LOCATION ON DETECTIONS.location = LOCATION.location_code
LEFT JOIN HATCH_TAG ON DETECTIONS.tagid = HATCH_TAG.tag_id_long  
WHERE LOCATION.site_description LIKE 'Mainstem Array'
AND HATCH_TAG.system = LOCATION.watershed 
AND LOCATION.subloc LIKE 'ds'

jlee2843 commented 6 months ago

For each species in each system, list the numbers of hatchery and wild tags going out and the timing of those detections (to be able to compare hatchery and wild outmigrations).

Uncertainties:
- How to count wild tags going out? (For now, I used number_release within the Hatch_Param table.)
- How to count hatcheries? (For now, I used number_tagged within the Hatch_Param table.)
- What should be the time frame of time of detection?

SELECT hatch_param.species,
  hatch_param.system,
  hatch_param.tagging_date,
  SUM(hatch_param.number_tagged) AS hatchery_tagged,
  SUM(hatch_param.number_released) AS outmigration
FROM hatch_param
GROUP BY
  hatch_param.species,
  hatch_param.system,
  hatch_param.tagging_date;

Modified as below after @brahmwg's comment:

SELECT hatch_tag.species,
  hatch_tag.system,
  STRING_AGG(field.date::TEXT, ', ') AS field_detection,
  STRING_AGG(microtroll.date::TEXT, ', ') AS microtroll_detection,
  COUNT(DISTINCT hatch_tag.tag_id_long) AS hatchery_count,
  COUNT(DISTINCT field.tag_id_long) AS field_count,
  COUNT(DISTINCT microtroll.tag_id_long) AS microtroll_count
FROM hatch_tag 
LEFT JOIN field ON hatch_tag.tag_id_long = field.tag_id_long
LEFT JOIN microtroll ON hatch_tag.tag_id_long = microtroll.tag_id_long
GROUP BY hatch_tag.species,
  hatch_tag.system;

brahmwg commented 6 months ago

For each species in each system, list the numbers of hatchery and wild tags going out and the timing of those detections (to be able to compare hatchery and wild outmigrations).

Uncertainties:

How to count wild tags going out? (For now, I used number_release within the Hatch_Param table.)

How to count hatcheries? (For now, I used number_tagged within the Hatch_Param table.)

What should be the time frame of time of detection?

Wild tags should include tagging events in field and microtroll. You could imagine that tagging events take place at hatcheries (hatch_tag), in the marine environment (microtroll) and in between (in-river, beach/estuarine, etc. in field)

Fish tagged at hatcheries are 100% hatchery-origin. Fish tagged in rivers, beach seines, or microtrolling can be hatchery or wild origin. Some, but not all hatchery fish are clipped so you can tell by looking at them (adipose fin removed) if they are a hatchery fish or not. So if a fish is caught in river, seine, or microtroll and has clip_status = clip, then it is hatchery-origin. However, you could still have a fish with clip-stats = unclip that is also of hatchery-origin, it just wasn't clipped at the hatchery. This is where we need to link to our genetics results. If that individual was identified through PBT analysis, then it was a hatchery-origin fish. PBT or Parentage-Based-Tagging is something only done in hatcheries where they take the genetic information of the parent fish when pairing them to fertilize the females eggs. If the individual was identified to stock using GSI (Genetic Stock Information), then it was likely a wild fish.

Just to clarify; So in hatch_tag and hatch_param, system = hatchery name, but in the field and microtroll tables, system = river system that the fish is from (and could be the hatchery name if it turns out to be a hatchery fish or could be a wild fish)

Also, be aware of the tag_status column, which indicates if a fish that was caught was tagged, recaptured a previously tagged fish, mort (died after tagging) or rejected the tag. Not every row in field or microtroll is necessarily a tagging event, it could be a recap or other.

brahmwg commented 6 months ago

For each system/watershed, list all tags deployed that were subsequently redetected on the lowest mainstem array or in the estuary immediately after outmigration (to look at freshwater survival rather than overall survival and adult returns).

I have yet included "estuary immediately after outmigration" to filter through date to ensure is not return, what should the window be? Can I use the same approach to determine fresh water instead of overall survival?

Lowest mainstem array can be found by looking at the lowest number location.location_code for a particular watershed. Also helpful to look at the site_description. I have also quickly made a map displaying the locations of the antennas, cleaning tables, etc. The map can be seen here: https://marinescience.info/superset/explore/p/JNb87V0mMK1/ Note the legend: har = harvest, pre = predator, tag = tagging location, det = antenna detection location.

In field, tag_status = 'recap' where the method = 'beach seine' would indicate that they survived the freshwater and made it to the estuarine environment.

Regarding the time window to determine if it is an outmigrant, our researchers have made R code with this decision rule in it. The R code is found in documentation/sample_code/detections. Alternatively, an arbitrary time would be at least 100 days from tagging date to detection.

AReyH commented 6 months ago

This query may also need to consider the number of tags deployed in field and in microtroll. PIT tagging fish can occur at hatcheries, or in the wild during field work (at beaches, etc) or microtrolling. Be aware of the tag_status column, which may indicate whether a particular fish was tagged, recaptured a previously tagged fish, etc.

@brahmwg Thanks for the prompt reply! I have a few questions about the query:

What is the difference between the fish tagged in the table hatch_tag where the Stage is microtroll, and the fish present in the table microtroll?
What status should we be aware of in the tag_status column for this specific query? Are you guys interested only in a specific status for this query?

brahmwg commented 6 months ago

What is the difference between the fish tagged in the table hatch_tag where the Stage is microtroll, and the fish present in the table microtroll?

What status should we be aware of in the tag_status column for this specific query? Are you guys interested only in a specific status for this query?

This table has now been updated. The stage is mostly null since they are all tagged at hatcheries in hatch_tag. To identify the stage you should look at viable tags deployed in hatcheries (hatch_tag), in estuaries (field where method = 'beach seine', and in marine environment (microtroll).
For tag_status, should just be aware of 'recap', 'mort', 'reject' which should not be counted towards total tags deployed.

riyaeliza123 commented 6 months ago

WITH table_1 AS(
SELECT tag_id_long, watershed, species, fork_length_mm, date,
  MAX(date) OVER (PARTITION BY tag_id_long) AS last_date
FROM field
)
SELECT *, last_date - date AS no_of_days
FROM table_1
WHERE last_date - date > 100

This is for query 5 (For each species in each system/watershed, list the fork lengths of the fish tagged and identify which of those fish were detected as returning adults (compare size structure of initial and returning cohorts).)

This only returns 3 rows (could be because we have a lot of null values in tag_id_long)

Questions:

I'd like feedback on this, how can I improve the result or is this close to what we're looking for?
I'm wondering if this a calculation error or if the data only provides us this

My progress:

I have another table where I identified the fishes that were detected 2 or more times https://marinescience.info/sqllab/?savedQueryId=42
This gives us the last date a particular fish was seen: https://marinescience.info/sqllab/?savedQueryId=43
This is the table I am working with https://marinescience.info/sqllab/?savedQueryId=44. You can see that there are many null values in the tag column, and only 3 tagged fishes that have no_of_days > 100

brahmwg commented 6 months ago

@riyaeliza123 To improve this query, you'll want to join with the detections table. The idea being to see the date fish are tagged in field and the date they are detected in detections. You'll be able to see the fork length at the tagging date in field as well. You'll also want to look at the location where the tagging event occurred as well as the detection location.

riyaeliza123 commented 6 months ago

WITH table_1 AS (
  SELECT field.tag_id_long, field.watershed, field.species, field.fork_length_mm, field.date, det.last_date
  FROM field 
  INNER JOIN (
    SELECT DISTINCT d.tagid, MAX(DATE(d.datetime)) OVER (PARTITION BY d.tagid) AS last_date
    FROM detections d
  ) det ON field.tag_id_long = det.tagid
)

SELECT *, last_date - date AS no_of_days
FROM table_1
WHERE last_date - date > 100
ORDER BY no_of_days DESC

-- 1397 rows returned

This is my latest query, looks like this is what we are looking for. https://marinescience.info/sqllab/?savedQueryId=45

Upon confirmation, i'll push this to the queries.md file

jlee2843 commented 6 months ago

Currently, SQL query for Task 1 looks like this:

For each species in each system, list the numbers of hatchery and wild tags going out and the timing of those detections (to be able to compare hatchery and wild outmigrations).

SELECT hatch_tag.species,
  hatch_tag.system,
  COUNT(DISTINCT(hatch_tag.tag_id_long)) AS hatch_count,
  COUNT(DISTINCT(field.tag_id_long)) AS field_count
FROM hatch_tag
FULL OUTER JOIN field ON CAST(hatch_tag.date_time_release AS DATE) = field.date
WHERE  CAST(hatch_tag.date_time_release AS TIMESTAMP) >  TIMESTAMP '2020-12-31'
GROUP BY hatch_tag.species, hatch_tag.system, field.date;

During our discussion yesterday, it was mentioned (If I remember correctly!) to use tag_status = 'clip' from the field table. However, if we add WHERE condition based on that, no data is returned. Also, based on this query, only system shown in the result is cowichan.

Link to Query: https://marinescience.info/sqllab/?savedQueryId=46

jlee2843 commented 6 months ago

For each system each year, look up what proportion of the total tags out were detected on seal haul outs or in heron rookeries (to be able to easily pull out predator data and calculate predation rates).

SELECT 
    main.system,
    main.pred_count,
    main.total_entries,
    main.pred_count::INTEGER / main.total_entries::INTEGER AS proportion_over_count
FROM (
    SELECT 
        hatch_tag.system,
        SUM(CASE WHEN predator.predator IS NOT NULL THEN 1 ELSE 0 END) AS pred_count,
        COUNT(*) AS total_entries
    FROM 
        hatch_tag
    LEFT JOIN 
        predator ON hatch_tag.tag_id_long = predator.tag_id_long 
    WHERE 
        predator.predator IN ('heron', 'sea haul') OR predator.predator IS NULL
    GROUP BY 
        hatch_tag.system
) AS main;

This is for Task 6. It's returning all 0's for proportion_over_count, which I think is an issue coming from division. It works fine when I tried changing division to multiplication. We can figure this out together.

brahmwg commented 6 months ago

This is my latest query, looks like this is what we are looking for. https://marinescience.info/sqllab/?savedQueryId=45

Upon confirmation, i'll push this to the queries.md file

Well done @riyaeliza123 ! I believe this is correct. I will run it by Sam & Jamieson to see if there's anything to modify, but it looks good to me.

brahmwg commented 6 months ago

Currently, SQL query for Task 1 looks like this:

For each species in each system, list the numbers of hatchery and wild tags going out and the timing of those detections (to be able to compare hatchery and wild outmigrations).

During our discussion yesterday, it was mentioned (If I remember correctly!) to use tag_status = 'clip' from the field table. However, if we add WHERE condition based on that, no data is returned. Also, based on this query, only system shown in the result is cowichan.

Link to Query: https://marinescience.info/sqllab/?savedQueryId=46

@jlee2843 So you'll actually want to look at clip_status = 'clip' [as opposed to tag_status]to see if the fish was from a hatchery or wild. (fish that are clipped are from a hatchery, while unclipped may be wild or may be hatchery {this would require further exploration with the genetic data where available})

So this query will be a bit more involved; You may want to consider using views or CTEs for this. You can create a view based on a query (using the down arrow and CREATE VIEW next to the RUN button) and refer to it like any other table in other queries. You could accomplish the same with subqueries or CTEs though.

You'll want to look at hatch_tag counts (grouped by species and system) and also join those tag_id_long in hatch_tag with the tagids in detections. The idea is to see tags deployed at hatcheries which are detected later and where they are detected. This could all be a query and the result set used as a view to be included in a later part of this larger query.

You'll want to do a similar process for the field table, but it gets trickier because of the clip_status. So you'll want to look at tags deployed (tag_status = 'tag') and make a count grouped by clip_status, to give some preliminary insight into the number of hatchery (clip_status = 'clip') and [possibly] wild (clip_status = 'unclip') tags deployed in field. Similar to the hatch_tag table, you'll want to join this with the detections table to see which tagids are detected and when/where. This may also be a query that feeds into a view, which can be used in a later query to provide an overall summary.

After those are done, you may want to write another query with summarizes the output of the above queries in one. This is a challenging query so please feel free to ask me for help. I suggest using views as a way to break the problem down into smaller components, but using views is not strictly necessary and you may be able to solve it in other ways.

brahmwg commented 6 months ago

For each system each year, look up what proportion of the total tags out were detected on seal haul outs or in heron rookeries (to be able to easily pull out predator data and calculate predation rates).

@jlee2843 I would start by running a query to determine the total tags out for each system and year. So this will involve looking at the hatch_tag, field, & microtroll tables and getting a count grouped by system & year where the tag_status = 'tag'. This may also be a view to be referred to by later queries.

Once you know the total tags out for each system and year, you can compare that with the tags detected at heron rookeries and seal haul outs. This is made a bit easier by the predator column in the predator table, which indicates if a particular detection was found at a heron rookery or seal haul out. A simple query provides a count of total tag detections at each;

select count(*), predator from predator group by predator The tricky part will be relating these to the appropriate system and year. year could be extracted from the date column in predator. But for system, you could group by location_name in predator, and then determine which system these location_names belong to, but that may need some advice from our biologists since the location_names don't join perfectly with the location table. Let me know when you get to this point and we can help solve that part of the query.

rafecchang commented 6 months ago

Hello @brahmwg, I am working on "For each system/watershed, list all tags deployed that were subsequently redetected on the lowest mainstem array or in the estuary immediately after outmigration (to look at freshwater survival rather than overall survival and adult returns)." This is what I have so far; my thinking process is to have 4 subqueries, consider two tagging tables and two detecting/recapturing tables. I have a question on incorporating the greater than 100 days threshold for outmigration- for those who's first tag being at hatchery, there is no exact date data for me to calculate for 100 days in the HATCH_TAG table. With such cases- how should I design the first two subqueries to fulfill such requirement?

My query is below, the only part missing now is the location data to determine for down stream and method = beach seine as you mentioned above.

WITH 
HATCH_TAG_FIELD AS (
    SELECT DISTINCT HATCH_TAG.tag_id_long
    FROM HATCH_TAG 
    INNER JOIN FIELD ON HATCH_TAG.tag_id_long = FIELD.tag_id_long
    WHERE FIELD.tag_status = 'recap'
),
HATCH_TAG_DETECTIONS AS (
    SELECT DISTINCT HATCH_TAG.tag_id_long
    FROM HATCH_TAG 
    INNER JOIN DETECTIONS ON HATCH_TAG.tag_id_long = DETECTIONS.tagid
    WHERE HATCH_TAG.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_FIELD)
),
FIELD_FIELD_DETECTIONS AS (
    SELECT DISTINCT FIELD1.tag_id_long
    FROM FIELD AS FIELD1
    INNER JOIN FIELD AS FIELD2 ON FIELD1.tag_id_long = FIELD2.tag_id_long
    WHERE FIELD1.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_FIELD)
    AND FIELD1.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_DETECTIONS)
    AND FIELD1.tag_status = 'tag'
    AND FIELD2.tag_status = 'recap'
    AND DATE(FIELD2.date) - DATE(FIELD1.date) > 100
),
FIELD_DETECTIONS AS (
    SELECT DISTINCT FIELD.tag_id_long
    FROM FIELD 
    INNER JOIN DETECTIONS ON FIELD.tag_id_long = DETECTIONS.tagid
    WHERE FIELD.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_FIELD)
    AND FIELD.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_DETECTIONS)
    AND FIELD.tag_id_long NOT IN (SELECT tag_id_long FROM FIELD_FIELD_DETECTIONS)
    AND FIELD.tag_status = 'tag'
    AND DATE(DETECTIONS.datetime) - DATE(FIELD.date) > 100
)
SELECT * FROM HATCH_TAG_FIELD
UNION
SELECT * FROM HATCH_TAG_DETECTIONS
UNION
SELECT * FROM FIELD_FIELD_DETECTIONS
UNION
SELECT * FROM FIELD_DETECTIONS;

brahmwg commented 6 months ago

Hi @rafecchang , I will take a closer look a bit later and update my comment. At the moment,

I have a question on incorporating the greater than 100 days threshold for outmigration- for those who's first tag being at hatchery, there is no exact date data for me to calculate for 100 days in the HATCH_TAG table. With such cases- how should I design the first two subqueries to fulfill such requirement?

You could use the date_time_release column from the hatch_tag table for this. There are ~7k results in hatch_tag with non-null values for date_time_release. Alternatively, you will need to join with hatch_param based on the system, cohort & outmigration_y to find the tagging_date from hatch_param for the related rows in hatch_tag.

My query is below, the only part missing now is the location data to determine for down stream and method = beach seine as you mentioned above.

As you've pointed out, the subloc column from the location table indicates if a particular antenna array is upstream (us) or downstream (ds).

jlee2843 commented 6 months ago

@brahmwg Hello Brahm, I'm trying to break down Query 1 component by component.

You'll want to look at hatch_tag counts (grouped by species and system) and also join those tag_id_long in hatch_tag with the tagids in detections. The idea is to see tags deployed at hatcheries which are detected later and where they are detected. This could all be a query and the result set used as a view to be included in a later part of this larger query.
SELECT h.species, h.system, f.clip_status, h.tag_id_long, COUNT(h.tag_id_long) AS hatchery_deployed
FROM hatch_tag h
JOIN field f ON h.tag_id_long = f.tag_id_long
WHERE f.tag_status = 'tag' AND
h.clip_status = 'C' AND
f.clip_status = 'unclip'
GROUP BY f.clip_status, h.species, h.system, h.tag_id_long
The query above is used to make a VIEW called hatceryField3. This confuses me a bit since there is no entry in clip_status of hatchery_tag marked as 'clip'. All entries are either 'C' or NULL. Hence I used h.clip_status = 'C' in the query above, but doing this only returns one row.

You'll want to do a similar process for the field table, but it gets trickier because of the clip_status. So you'll want to look at tags deployed (tag_status = 'tag') and make a count grouped by clip_status, to give some preliminary insight into the number of hatchery (clip_status = 'clip') and [possibly] wild (clip_status = 'unclip') tags deployed in field. Similar to the hatch_tag table, you'll want to join this with the detections table to see which tagids are detected and when/where. This may also be a query that feeds into a view, which can be used in a later query to provide an overall summary.
SELECT h.species, h.system, h.tag_id_long, COUNT(h.tag_id_long)
FROM hatch_tag h 
JOIN detections d ON h.tag_id_long = d.tagid
GROUP BY h.species, h.system, h.tag_id_long
This will be used in creating another VIEW called hatcheryDetections.

I don't think I am understanding what the expected output is for the Query 1. In our previous conversation, there was also a mention of looking into dates that are beyond 2021, but is that not required anymore? Also there was a mention of using cohorts as well, but I am unsure how that can be tied to the queries above. It would be a great help to me if you can specify how you would want the final query output to be like.

jlee2843 commented 6 months ago

@brahmwg Here is the revised Query 6, addressing your feedbacks from the comment above!

WITH predator_counts AS (
    SELECT EXTRACT(YEAR FROM predator.date) AS year,
           SUM(CASE WHEN predator.predator = 'heron' THEN 1 ELSE 0 END) AS heron_count,
           SUM(CASE WHEN predator.predator = 'sea houl' THEN 1 ELSE 0 END) AS sea_houl_count,
           location.site_name AS location
    FROM predator
    JOIN location ON predator.location_name = location.site_name
    GROUP BY year, location.site_name
), 
hatch_tag_count AS (
  SELECT h.outmigration_y, h.system, COUNT(h.tag_id_long) AS tag_count
  FROM hatch_tag h
  LEFT JOIN field f ON h.tag_id_long = f.tag_id_long
  LEFT JOIN microtroll m ON h.tag_id_long = m.tag_id_long
  GROUP BY h.outmigration_y, h.system
)

SELECT year, heron_count, sea_houl_count
FROM predator_counts
JOIN hatch_tag_count ON predator_counts.location = hatch_tag_count.system
WHERE year > 2020;

I believe that JOIN location ON predator.location_name = location.site_name this part still needs confirmation. As of now, no output is returned. Saved query can be found here.

rafecchang commented 6 months ago

Hello @brahmwg here is the query I have now:

WITH 
    HATCH_TAG_FIELD AS (
        SELECT DISTINCT HATCH_TAG.tag_id_long, HATCH_TAG.system
        FROM HATCH_TAG 
        INNER JOIN FIELD ON HATCH_TAG.tag_id_long = FIELD.tag_id_long
        LEFT JOIN LOCATION AS LOCATION1 ON LOWER(HATCH_TAG.system) = LOWER(LOCATION1.watershed)
        LEFT JOIN LOCATION AS LOCATION2 ON LOWER(FIELD.site) = LOWER(LOCATION2.site_description)
        WHERE FIELD.tag_status = 'recap'
        AND LOCATION2.location_code IN ('3c', '3d', '3e', '3f', '2d', '41', '7e')
        AND LOCATION1.watershed = LOCATION2.watershed
        AND DATE(FIELD.date) - DATE(HATCH_TAG.date_time_release) > 100
    ),
    HATCH_TAG_DETECTIONS AS (
        SELECT DISTINCT HATCH_TAG.tag_id_long, HATCH_TAG.system
        FROM HATCH_TAG 
        INNER JOIN DETECTIONS ON HATCH_TAG.tag_id_long = DETECTIONS.tagid
        LEFT JOIN LOCATION AS LOCATION1 ON LOWER(HATCH_TAG.system) = LOWER(LOCATION1.watershed)
        LEFT JOIN LOCATION AS LOCATION2 ON LOWER(DETECTIONS.location) = LOWER(LOCATION2.location_code)
        WHERE HATCH_TAG.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_FIELD)
        AND LOCATION2.subloc = 'ds'
        AND DATE(DETECTIONS.datetime) - DATE(HATCH_TAG.date_time_release) > 100
    ),
    FIELD_FIELD_DETECTIONS AS (
        SELECT DISTINCT FIELD1.tag_id_long, FIELD1.watershed
        FROM FIELD AS FIELD1
        INNER JOIN FIELD AS FIELD2 ON FIELD1.tag_id_long = FIELD2.tag_id_long
        LEFT JOIN LOCATION AS LOCATION1 ON LOWER(FIELD1.site) = LOWER(LOCATION1.site_description)
        LEFT JOIN LOCATION AS LOCATION2 ON LOWER(FIELD2.site) = LOWER(LOCATION2.site_description)
        WHERE FIELD1.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_FIELD)
        AND FIELD1.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_DETECTIONS)
        AND FIELD1.tag_status = 'tag'
        AND FIELD2.tag_status = 'recap'
        AND DATE(FIELD2.date) - DATE(FIELD1.date) > 100
        AND (
            (LOCATION1.location_code = '39' AND LOCATION2.location_code IN ('3c', '3d', '3f'))
            OR (LOCATION1.location_code IN ('2c', '2f', '2e') AND LOCATION2.location_code = '2d')
            OR (LOCATION1.location_code IN ('43', '42') AND LOCATION2.location_code = '41')
            OR (LOCATION1.location_code IN ('79', '78', '7f') AND LOCATION2.location_code = '7e')
        )
    ),
    FIELD_DETECTIONS AS (
        SELECT DISTINCT FIELD.tag_id_long, FIELD.watershed
        FROM FIELD 
        INNER JOIN DETECTIONS ON FIELD.tag_id_long = DETECTIONS.tagid
        LEFT JOIN LOCATION ON DETECTIONS.location = LOCATION.location_code
        WHERE FIELD.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_FIELD)
        AND FIELD.tag_id_long NOT IN (SELECT tag_id_long FROM HATCH_TAG_DETECTIONS)
        AND FIELD.tag_id_long NOT IN (SELECT tag_id_long FROM FIELD_FIELD_DETECTIONS)
        AND FIELD.tag_status = 'tag'
        AND DATE(DETECTIONS.datetime) - DATE(FIELD.date) > 100
        AND LOCATION.subloc = 'ds'
    )
SELECT * FROM HATCH_TAG_FIELD
UNION
SELECT * FROM HATCH_TAG_DETECTIONS
UNION
SELECT * FROM FIELD_FIELD_DETECTIONS
UNION
SELECT * FROM FIELD_DETECTIONS;

edit: I consider the relative geographical locations by looking at the map for field. For detection, I used 'ds' only

brahmwg commented 6 months ago

@jlee2843 I will try my best to address all of your questions here, and I will come back to update my comment as I review your query;

This confuses me a bit since there is no entry in clip_status of hatchery_tag marked as 'clip'.

You can assume all tags deployed at hatcheries have clip_status = 'clip'. In other tables, when a fish is caught and the researcher notices that it has it's fin clipped, that is an indicator that the fish originated from a hatchery.

The original query (query 1) in english:

For each species in each system, list the numbers of hatchery and wild tags going out and the timing of those detections (to be able to compare hatchery and wild outmigrations).

Starting with the first section "For each species in each system, list the numbers of hatchery and wild tags going out", we would need to use an aggregate function like COUNT(*) and GROUP BY species & system. The query below would produce those summaries for the hatch_tag table;

-- This query produces a summary of counts of tags deployed by system and species for hatcheries
SELECT COUNT(*), system, species 
from hatch_tag 
group by system, species

For the wild tags, we can do a similar summary for the field table. Notably, we would want to only include tag_status = 'tag' (so we exclude recaps, morts, rejects) and to filter by clip_status= 'unclip' to exclude fish that are clipped (and therefore originated from a hatchery).

-- This query produces summary counts of tags deployed by watershed & species for wild tags in the field table
SELECT count(*), watershed, species
from field 
where tag_status = 'tag'
and clip_status= 'unclip'
group by watershed, species

The next part of the original query question "and the timing of those detections (to be able to compare hatchery and wild outmigrations)" will require looking at the individual tags and not the aggregations. For this you'll want to join the tags deployed at hatcheries (hatch_tag) with tags detected by antennas (detections) to understand which tags deployed at hatcheries are being detected later at downstream antennas (subloc = 'ds'). You'll want to do the same procedure but for the wild tags in field; so use a query to filter by tag_status = 'tag' and clip_status='unclipped' to produce a list of tag_id_longs which represent wild tags from the field table, and then join those tagids with detections to again see which tagids are being detected at downstream antennas. You may calculate the difference in days between the tagging event (field.date for the wild tags, hatch_tag.date_time_release for the hatchery tags) and the detection event (detections.datetime).

The final output of this would list tags, indicating the species, system/watershed, if the tag is wild or hatchery origin, the difference in days between tag and detection, etc.

For query 6;

SUM(CASE WHEN predator.predator = 'sea houl' THEN 1 ELSE 0 END) AS sea_houl_count,

This won't produce any results because 'sea houl' is not a valid value for the column predator.predator. You can use SELECT DISTINCT predator FROM ods.predator to see the range of possible values for that column; there are only two, heron or pinniped.

The original text for query 6;

For each system each year, look up what proportion of the total tags out were detected on seal haul outs or in heron rookeries (to be able to easily pull out predator data and calculate predation rates).

So first you'll want to calculate the total tags out for each system and year. This will require some queries like we did above, to get a count of total tags deployed grouped by system and year for the hatch_tag, field, and microtroll tables. For field & microtroll, be sure to filter where tag_status = 'tag'. You could then SUM() the counts from each of those tables to give a total count of tags grouped by system & year. You would then know the total tags out (sum of counts for hatch_tag, field, & microtroll).

Next, you'd want to get a count of tags detected in seal haul outs or heron rookeries. The query below can provide that summary;

-- This query provides a count of tags detected at either seal haul outs or heron rookeries
select count(*), predator, location_name
from predator 
group by predator, location_name

You might then be able to do some general comparison to see out of all tags out in a given year, how many were detected at heron/pinniped locations. You could extract year from predator.date to compare totals on a per year basis. For location comparisons, it will be trickier so you can omit that part for now.

brahmwg commented 6 months ago

@AReyH I responded to your query over slack. Mainly the results for hatchery seem inaccurate due to NULL values for dates. I created a view / query to retrieve the tagging_date from hatch_param and link it with the hatch_tag table. Also, may need to include a stage for 'estuary' which should include tagging events from field where the method='beach seine'.

jlee2843 commented 6 months ago

@brahmwg Hello Brahm, thank you for your feedback. I revised my Query 1 based on your clarifications, and this is what I have now:

WITH hatchTagCount AS (
  SELECT COUNT(*) AS count, system, species, date_time_release::DATE, tag_id_long, 'hatchery' AS origin
  FROM hatch_tag
  GROUP BY system, species, date_time_release, tag_id_long
), fieldCount AS (
  SELECT COUNT(*) AS count, watershed, species, date, tag_id_long, 'field' AS origin
  FROM field 
  WHERE tag_status = 'tag'
  AND clip_status = 'unclip'
  GROUP BY watershed, species, date, tag_id_long
), locationDetection AS (
  SELECT *, detections.datetime::DATE AS date
  FROM detections 
  LEFT JOIN location ON detections.location = location.location_code
  WHERE location.subloc = 'ds'
), hatchTagDetectionList AS (
  SELECT hatch_tag.tag_id_long, hatchTagCount.species, hatchTagCount.system, hatchTagCount.date_time_release, hatchTagCount.origin, locationDetection.date AS detection_date
  FROM hatch_tag 
  LEFT JOIN locationDetection ON hatch_tag.tag_id_long = locationDetection.tag_id
  LEFT JOIN hatchTagCount ON hatch_tag.tag_id_long = hatchTagCount.tag_id_long
  WHERE hatchTagCount.date_time_release IS NOT NULL AND locationDetection.date IS NOT NULL
), fieldDetectionList AS (
  SELECT field.tag_id_long, locationDetection.datetime, field.watershed, fieldCount.species, field.date, fieldCount.origin, locationDetection.date AS detection_date
  FROM field 
  LEFT JOIN locationDetection ON field.tag_id_long = locationDetection.tagid
  LEFT JOIN fieldCount ON field.tag_id_long = fieldCount.tag_id_long
  WHERE field.tag_status = 'tag' AND field.clip_status = 'unclip' AND field.date IS NOT NULL
), fieldComplete AS (
  SELECT fieldDetectionList.tag_id_long, 
    fieldDetectionList.watershed AS "system/watershed",
    fieldDetectionList.species,
    fieldDetectionList.origin,
    ROUND(EXTRACT(EPOCH FROM (fieldDetectionList.datetime - fieldDetectionList.date)) / 86400, 3) AS diff_date
  FROM fieldDetectionList
), hatchComplete AS (
  SELECT hatchTagDetectionList.tag_id_long, 
    hatchTagDetectionList.system AS "system/watershed",
    hatchTagDetectionList.species,
    hatchTagDetectionList.origin,
    ROUND(EXTRACT(EPOCH FROM (hatchTagDetectionList.detection_date - hatchTagDetectionList.date_time_release)) / 86400, 3) AS diff_date
  FROM hatchTagDetectionList
)

https://marinescience.info/sqllab?savedQueryId=52

Can you let me know if I am going in the right direction? I think the fieldComplete query is looking fine, but hatchTagDetectionList does not return any rows because based on what I have, locationDetection.date is all NULL values.

When I figure out what is wrong with the locationDetection.date in hatchTagDetectionList, I plan on concatenating fieldComplete and hatcheryComplete. I would really appreciate it if you can let me know if this is going in the direction that you are expecting me to!

brahmwg commented 6 months ago

Hi @jlee2843 , I replied to your latest question over slack. I think you are on the right track, and I created some views which should hopefully simplify your query.

riyaeliza123 commented 4 months ago

Resolved and complete