Kaggle / learntools

Tools and tests used in Kaggle Learn exercises
Apache License 2.0
447 stars 231 forks source link

SQL Intro Course Exercise 3 not being set up #452

Open semibroiled opened 1 year ago

semibroiled commented 1 year ago

The setup imports are written to import hacker_news with the actual being hacker-news which may be the possible cause behind this.

semibroiled commented 1 year ago

Collecting git+https://github.com/Kaggle/learntools.git Cloning https://github.com/Kaggle/learntools.git to /tmp/pip-req-build-l71k0skd Running command git clone --filter=blob:none --quiet https://github.com/Kaggle/learntools.git /tmp/pip-req-build-l71k0skd fatal: unable to access 'https://github.com/Kaggle/learntools.git/': Could not resolve host: github.com error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/Kaggle/learntools.git class="ansi-green-fg"> /tmp/pip-req-build-l71k0skd did not run successfully. │ exit code: 128 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip. error: subprocess-exited-with-error

× git clone --filter=blob:none --quiet https://github.com/Kaggle/learntools.git class="ansi-green-fg"> /tmp/pip-req-build-l71k0skd did not run successfully. │ exit code: 128 ╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip. Using Kaggle's public dataset BigQuery integration.

NotFound Traceback (most recent call last) Cell In[4], line 9 4 binder.bind(globals()) 5 ##import os 6 #if not os.path.exists('/kaggle/input/hacker_news/comments'): 7 # os.symlink('/input/hacker-news/comments','.../input/hacker_news/') 8 # pass ----> 9 from learntools.sql.ex3 import * 10 print("Setup Complete")

File /opt/conda/lib/python3.10/site-packages/learntools/sql/ex3.py:17 10 prolific_commenters_query = """ 11 SELECT author, COUNT(id) AS NumPosts 12 FROM bigquery-public-data.hacker_news.comments 13 GROUP BY author 14 HAVING COUNT(id) > 10000 15 """ 16 query_job = client.query(prolific_commenters_query) ---> 17 prolific_commenters_answer = query_job.to_dataframe() 19 # (2) NumDeletedPosts 20 deleted_posts_query = """ 21 SELECT COUNT(1) AS num_deleted_posts 22 FROM bigquery-public-data.hacker_news.comments 23 WHERE deleted = True 24 """

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/query.py:1695, in QueryJob.to_dataframe(self, bqstorage_client, dtypes, progress_bar_type, create_bqstorage_client, date_as_object, max_results, geography_as_object) 1613 def to_dataframe( 1614 self, 1615 bqstorage_client: "bigquery_storage.BigQueryReadClient" = None, (...) 1621 geography_as_object: bool = False, 1622 ) -> "pandas.DataFrame": 1623 """Return a pandas DataFrame from a QueryJob 1624 1625 Args: (...) 1693 :mod:shapely library cannot be imported. 1694 """ -> 1695 query_result = wait_for_query(self, progress_bar_type, max_results=max_results) 1696 return query_result.to_dataframe( 1697 bqstorage_client=bqstorage_client, 1698 dtypes=dtypes, (...) 1702 geography_as_object=geography_as_object, 1703 )

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/_tqdm_helpers.py:88, in wait_for_query(query_job, progress_bar_type, max_results) 84 progress_bar = get_progress_bar( 85 progress_bar_type, "Query is running", default_total, "query" 86 ) 87 if progress_bar is None: ---> 88 return query_job.result(max_results=max_results) 90 i = 0 91 while True:

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/query.py:1499, in QueryJob.result(self, page_size, max_results, retry, timeout, start_index, job_retry) 1496 if retry_do_query is not None and job_retry is not None: 1497 do_get_result = job_retry(do_get_result) -> 1499 do_get_result() 1501 except exceptions.GoogleAPICallError as exc: 1502 exc.message = _EXCEPTION_FOOTER_TEMPLATE.format( 1503 message=exc.message, location=self.location, job_id=self.job_id 1504 )

File /opt/conda/lib/python3.10/site-packages/google/api_core/retry.py:349, in Retry.call..retry_wrapped_func(*args, *kwargs) 345 target = functools.partial(func, args, **kwargs) 346 sleep_generator = exponential_sleep_generator( 347 self._initial, self._maximum, multiplier=self._multiplier 348 ) --> 349 return retry_target( 350 target, 351 self._predicate, 352 sleep_generator, 353 self._timeout, 354 on_error=on_error, 355 )

File /opt/conda/lib/python3.10/site-packages/google/api_core/retry.py:191, in retry_target(target, predicate, sleep_generator, timeout, on_error, **kwargs) 189 for sleep in sleep_generator: 190 try: --> 191 return target() 193 # pylint: disable=broad-except 194 # This function explicitly must deal with broad exceptions. 195 except Exception as exc:

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/query.py:1489, in QueryJob.result..do_get_result() 1486 self._retry_do_query = retry_do_query 1487 self._job_retry = job_retry -> 1489 super(QueryJob, self).result(retry=retry, timeout=timeout) 1491 # Since the job could already be "done" (e.g. got a finished job 1492 # via client.get_job), the superclass call to done() might not 1493 # set the self._query_results cache. 1494 self._reload_query_results(retry=retry, timeout=timeout)

File /opt/conda/lib/python3.10/site-packages/google/cloud/bigquery/job/base.py:728, in _AsyncJob.result(self, retry, timeout) 725 self._begin(retry=retry, timeout=timeout) 727 kwargs = {} if retry is DEFAULT_RETRY else {"retry": retry} --> 728 return super(_AsyncJob, self).result(timeout=timeout, **kwargs)

File /opt/conda/lib/python3.10/site-packages/google/api_core/future/polling.py:261, in PollingFuture.result(self, timeout, retry, polling) 256 self._blocking_poll(timeout=timeout, retry=retry, polling=polling) 258 if self._exception is not None: 259 # pylint: disable=raising-bad-type 260 # Pylint doesn't recognize that this is valid in this case. --> 261 raise self._exception 263 return self._result

NotFound: 404 Not found: Table bigquery-public-data:hacker_news.comments was not found in location US

Location: US Job ID: 9c49c97a-7566-47e9-bdc5-cc58ce59a18a

agamemnonc commented 11 months ago

Any updates on this?

semibroiled commented 11 months ago

No unfortunately my method didnt work as intended as the whole datasets path is changed. I did find a workaround from a comment in Kaglge itself. I'll post it here for your convenience

semibroiled commented 11 months ago

Ex - 3 Solution steps

  1. Add this into the first cell :

` from learntools.core import binder

binder.bind(globals()) with open("/opt/conda/lib/python3.10/site-packages/learntools/sql/ex3.py","r") as f : ex3 = f.read() ex3 = ex3.replace("SELECT author","SELECT by") ex3 = ex3.replace("GROUP BY author","GROUP BY by") ex3 = ex3.replace("\'author\'","\'by\'") ex3 = ex3.replace("author","`by`") ex3 = ex3.replace("bigquery-public-data.hacker_news.comments","bigquery-public-data.hacker_news.full")

with open("/opt/conda/lib/python3.10/site-packages/learntools/sql/ex3_v2.py","w") as f2: f2.write(ex3) from learntools.sql.ex3_v2 import *`

  1. change the table name from comments to full. table_ref = dataset_ref.table("full")

  2. This query works now : prolific_commenters_query = """SELECTby, COUNT(1) AS NumPosts FROMbigquery-public-data.hacker_news.full GROUP BYby HAVING COUNT(1) > 10000"""

semibroiled commented 11 months ago

Solution comes from user Palak807 on Kaggle and worked for me. Turns out the dataset name itself has changed with respect to the _ and - characters