Open Thaslim42 opened 5 months ago
What ever Python package you are using is expecting to only ever be run in a process context where stdout/stderr are linked to a tty device or file. Python doesn't technically guarantee that and a file like object is not obligated to provide a fileno
attribute.
Can you paste (as text, not an image), the full stack trace so it is possible to see what Python package you are using?
If it is your own code, it needs to be able to deal with fileno
not being valid and fallback to doing something else to avoid an error.
Thanks for the fast reply...this project is for websraping using openaikey and iam providing sample code that you will understand..........
import json
import os
from scrapegraphai.graphs import SmartScraperGraph
from dotenv import load_dotenv
import requests
import logging
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")
if not openai_key:
logging.error("OPENAI_API_KEY is not set. Please set it in the .env file.")
exit(1)
graph_config = {
"llm": {
"api_key": openai_key,
"model": "gpt-3.5-turbo",
},
"verbose": False,
"headless": True,
}
prompt_template = """
Extract the following details for each job listing (do not include duplicates).
Key requirements to scrape data from the website are:
- 'Job title (mandatory)'
- 'skills (mandatory)' which might be listed under other names such as [skill requirement, required skills, required skillset]
- 'years of experience' which might be listed under other names such as [Experience, +(number of year) years of experience, (number of year)+ years of experience].
- If the format is +(number of year) years of experience, then add the value to 'year_of_range_from' key.
- If only 'Experience' is mentioned, set 'year_of_range_from' to 0 and 'year_of_range_to' to the value of 'Experience'.
Store all the scraped data into the job_listings list.
Here is the structure of each job listing:
{
"job_title": "job title",
"company_name": "company name",
"description": "job description mentioned",
"skill_set": [
{
"skill": "skill"
},
{
"skill": "skill"
},
{
"skill": "skill"
}
],
"year_of_range_from": experience from,
"year_of_range_to": experience to, if not detected print 0,
"location": "location of job",
"salary_info": "salary range, if not detected print not specified",
"posted_date": "JOB posted date, must add '-' between day month and year, if not detected print 10-06-2024",
"application_url": "job application URL"
}
Ensure the output is a JSON array where each element is a JSON object containing the details of a job listing.
"""
def load_previously_scraped_data(file_path):
if os.path.exists(file_path):
if os.path.getsize(file_path) > 0:
with open(file_path, "r") as file:
return json.load(file)
else:
return []
return []
def save_scraped_data(file_path, data):
with open(file_path, "w") as file:
json.dump(data, file, indent=4)
scraped_data_file = "previously_scraped_jobs.json"
previously_scraped_data = load_previously_scraped_data(scraped_data_file)
previously_scraped_urls = {job["application_url"] for job in previously_scraped_data}
all_job_listings = []
def scrape_website(website):
for page in range(1): # Adjust the range for more pages
params = website["params"].copy()
params["startPage"] = page
url = (
website["base_url"] + "?" + "&".join(f"{k}={v}" for k, v in params.items())
)
logging.info(f"Scraping URL: {url}")
smart_scraper_graph = SmartScraperGraph(
prompt=prompt_template, source=url, config=graph_config
)
result = smart_scraper_graph.run()
logging.info(f"Scrape result: {result}")
if result and result["job_listings"] is not None:
for job_listing in result["job_listings"]:
if "application_url" not in job_listing:
logging.warning(
f"Skipping job listing without application_url: {job_listing}"
)
continue
if job_listing["application_url"] not in previously_scraped_urls:
all_job_listings.append(job_listing)
previously_scraped_urls.add(job_listing["application_url"])
else:
logging.info(f"No job listings found for {website['name']} on page {page}")
# Configure logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)
def transformData(all_job_listings):
transformed_data = []
for job_listing in all_job_listings:
logging.info(f"Transforming job listing: {job_listing}")
transformed_job = {
"job_title": job_listing["job_title"],
"company_name": job_listing["company_name"],
"description": job_listing["description"],
"skill_set": job_listing["skill_set"],
"year_of_range_from": job_listing["year_of_range_from"],
"year_of_range_to": job_listing["year_of_range_to"],
"location": job_listing["location"],
"salary_info": job_listing["salary_info"],
"posted_date": job_listing["posted_date"],
"application_url": job_listing["application_url"],
}
transformed_data.append(transformed_job)
newly_scraped_data = previously_scraped_data + transformed_data
save_scraped_data(scraped_data_file, newly_scraped_data)
# Save the transformed new data to a separate JSON file
with open("API_jobs.json", "w") as file:
json.dump(transformed_data, file, indent=4)
logging.info("Transformed data saved to API_jobs.json")
logging.info("Data being sent to the API")
and also it is showing like this.........The error "OSError: Apache/mod_wsgi log object is not associated with a file descriptor" typically occurs when a Python script or application, in this case, a Ray program, is trying to use a logger that is not compatible with the mod_wsgi environment.
I still need to see the full Python stack trace from the Apache error log to understand what is the calling sequence.
root@vps:/var/www/recruitment-app/RECRUITMENT minute='51'], next run at: 2024-06-27 10:51:00 IST)" (scheduled at 2024-06-27 10:51:00+05:30) [Thu Jun 27 10:51:00.003737 2024] [wsgi:error] [pid 227404:tid 140555544450624] scrapping started...... [Thu Jun 27 10:51:00.003788 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.004018 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,003 - INFO - Scraping Infopark Kochi phase 1... [Thu Jun 27 10:51:00.394899 2024] [wsgi:error] [pid 227404:tid 140555544450624] --- Executing Fetch Node --- [Thu Jun 27 10:51:00.395802 2024] [wsgi:error] [pid 227404:tid 140555544450624] =1) (Fetching HTML from: https://infopark.in/companies/jobs/kochi-phase-1?startPage [Thu Jun 27 10:51:00.586503 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,586 - INFO - Starting scraping... [Thu Jun 27 10:51:00.589935 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,587 - ERROR - Task exception was never retrieved [Thu Jun 27 10:51:00.589995 2024] [wsgi:error] [pid 227404:tid 140555544450624] future: <Task finished name='Task-2' coro=<Connection.run() done, defined at /usr/1 ocal/lib/python3.10/dist-packages/playwright/_impl/_connection.py:265> exception=OSError('Apache/mod_wsgi log object is not associated with a file descriptor.')> [Thu Jun 27 10:51:00.590012 2024] [wsgi:error] [pid 227404:tid 140555544450624] Traceback (most recent call last): [Thu Jun 27 10:51:00.590021 2024] [wsgi:error] [pid 227404:tid 140555544450624] File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", 1 The 272, in fun ine 272, in run [Thu Jun 27 10: Jun 27 10:51:00.590029 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.590035 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 133, in connect [Thu Jun 27 27 10:51:00.590042 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.590049 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 126, in connect [Thu Jun 27 10:51:00.590056 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.590063 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 38, in _get_stderr_fileno i Thu Jun 27 await self._transport.connect() File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li raise exc File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li stderr=_get_stderr_fileno (), File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li 10:51:00.590111 2024] [wsgi:error] [pid 227404:tid 140555544450624] return sys.stderr.fileno () [Thu Jun 27 10:51:00.590124 2024] [wsgi:error] [pid 227404:tid 140555544450624] OSError: Apache/mod_wsgi log object is not associated with a file descriptor. [Thu Jun 27 10:51:00.593679 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,593 - ERROR - An error occurred during scraping: Apache/mod_wsg log object is not is not associated with a file descriptor. [Thu Jun 27 10:51:00.594078 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,593 INFO - Scraping Infopark cherthala... [Thu Jun 27 10:51:00.872788 2024] [Thu Jun 27 10:51:00.873161 2024] [wsgi:error] [pid 227404:tid 140555544450624] --- Executing Fetch Node --- [wsgi:error] [pid 227404:tid 140555544450624] (Fetching HTML from: https://infopark.in/companies/jobs/cherthala?startPage=1) [Thu Jun 27 10:51:00.873844 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,873 - INFO - Starting scraping... [Thu Jun 27 10:51:00.874777 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,874 - ERROR - Task exception was never retrieved Thu Jun 27 10:51:00.874803 2024] [wsgi:error] [pid 227404:tid 140555544450624] future: <Task finished name='Task-6' coro=<Connection.run() done, defined at /usr/l ocal/lib/python3.10/dist-packages/playwright/_impl/_connection.py:265> exception=OSError('Apache/mod_wsgi log object is not associated with a file descriptor.')> 10:51:00.874818 2024] [wsgi:error] [pid 227404:tid 140555544450624] Traceback (most recent call last): Thu Jun 21 [Thu Jun 27 10:51:00.874828 2024] [wsgi:error] [pid 227404:tid 140555544450624] ine 272, in fun ine 272, in run [Thu Jun 27 10: Jun 27 10:51:00.874835 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.874841 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 133, in connect [Thu Jun 27 10:51:00.874848 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.874855 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 126, in connect [Thu Jun 27 10:51:00.874867 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.874876 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 38, in _get_stderr_fileno Thu Jun File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", 1 await self._transport.connect() File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li raise exc File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li stderr=_get_stderr_fileno (), File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li return sys.stderr.fileno () 27 10:51:00.874883 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:00.874889 2024] [wsgi:error] [pid 227404:tid 140555544450624] OSError: Apache/mod_wsgi log object is not associated with a file descriptor. [Thu Jun 27 10:51:00.876827 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,876 - ERROR - An error occurred during scraping: Apache/mod_wsg i log object is not associated with a file descriptor. [Thu Jun 27 10:51:00.877092 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:00,876 - INFO - Scraping Infopark Thrissur... [Thu Jun 27 10:51:01.191071 2024] [wsgi:error] [pid 227404:tid 140555544450624] --- Executing Fetch Node --- [Thu Jun 27 10:51:01.191320 2024] [wsgi:error] [pid 227404:tid 140555544450624] (Fetching HTML from: https://infopark.in/companies/jobs/thrissur?startPage=1) - [Thu Jun 27 10:51:01.191946 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:01,191 - INFO - Starting scraping... [Thu Jun 27 10:51:01.192880 2024] [wsgi:error] [pid 227404:tid 140555544450624] 2024-06-27 10:51:01,192 - ERROR - Task exception was never retrieved [Thu Jun 27 10:51:01.193766 2024] [wsgi:error] [pid 227404:tid 140555544450624] future: <Task finished name='Task-10' coro=<Connection.run() done, defined at /usr/ local/lib/python3.10/dist-packages/playwright/_impl/_connection.py:265> exception=OSError('Apache/mod_wsgi log object is not associated with a file descriptor.')> [Thu Jun 27 10:51:01.194112 2024] [wsgi:error] [pid 227404:tid 140555544450624] Traceback (most recent call last): [Thu Jun 27 10:51:01.194135 2024] [wsgi:error] [pid 227404:tid 140555544450624] File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_connection.py", 1 ine 272, in run ine 272, in run [Thu Jun 27 10:51:01.194145 2024] [wsgi:error] [pid 227404:tid 140555544450624] [Thu Jun 27 10:51:01.194152 2024] [wsgi:error] [pid 227404:tid 140555544450624] ne 133, in connect [Thu Jun 27 10:51:01.194159 2024] [wsgi:error] [pid 227404:tid 140555544450624] await self._transport.connect() File "/usr/local/lib/python3.10/dist-packages/playwright/_impl/_transport.py", li raise exc X
The problem is here:
That package takes code from faulthandler which back in time was known not to handle very well when stdout/stderr were not associated with a file descriptor.
They then used though that bad practice themselves to get a file descriptor to use with a sub process for stderr.
This is a bad way of doing things they are using. They shouldn't override stderr at all and should just inherit the process state for it.
I have no idea whether it will work or not, but you might be able to set:
WSGIRestrictStdout Off
directive in Apache, but I can't remember what the implications on logging by Apache will be if that is done.
WSGIRestrictStdout Off WSGIRestrictStdin Off both need to off? or can i add custom logger like this......import logging
# Create a custom logger that writes to a file
logger = logging.getLogger('my_logger')
logger.setLevel(logging.INFO)
handler = logging.FileHandler('my_log_file.log')
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
# Use the custom logger instead of the default logger
def start_scrape():
#...
try:
for website in websites:
try:
logger.info(f"Scraping {website['name']}...")
scrape_website(website)
except TypeError as te:
logger.error(f"A TypeError occurred during scraping: {te}")
except Exception as e:
logger.error(f"An error occurred during scraping: {e}")
except Exception as e:
logger.error(f"An error occurred during scraping: {e}")
I have no idea what playwright is for. I am presuming where it is failing is nothing to do with logging as such.
There wouldn't be any point in changing WSGIRestrictStdin
as only seemed to be doing stuff with stderr and not stdin. The WSGIRestrictStdout
directive also affects stderr
from memory.
Of note, faulthandler in Python standard library is now implemented in C code and is tolerant of any exception when accessing fileno
attribute. So they copied the bad way of doing things from time when it was implemented in Python.
The C version is closer to doing:
def _get_stderr_fileno() -> Optional[int]:
try:
# when using pythonw, sys.stderr is None.
# when Pyinstaller is used, there is no closed attribute because Pyinstaller monkey-patches it with a NullWriter class
if sys.stderr is None or not hasattr(sys.stderr, "closed"):
return None
if sys.stderr.closed:
return None
return sys.stderr.fileno()
except Exception:
# pytest-xdist monkeypatches sys.stderr with an object that is not an actual file.
# https://docs.python.org/3/library/faulthandler.html#issue-with-file-descriptors
# This is potentially dangerous, but the best we can do.
if not hasattr(sys, "__stderr__") or not sys.__stderr__:
return None
return sys.__stderr__.fileno()
In other words, on any exception, ignore it and try the alternate lookup.
I realise you can't readily fix playwright code exception by using wrapt
perhaps to monkey patch it to replace with better variant of _get_stderr_fileno
function.
Okay, looking at Playwright, it is forking browser processes. Doing that inside of a web server such as Apache is a really bad idea.
You should look at changing the architecture of things and use a task queueing system like Celery to provide a distinct service which you can make requests of to do your scraping and have the web service part talk to that to do the work and wait for results.
In other words, creating major sub processes from web applications is generally not recommended since sub processes would inherit of a lot of strange state from the web server processes. Eg., open socket connections for incoming requests, and so could interfere with the operation of the web server. Thus one usually farms out such stuff to a separate independent service.
so which webserver you would recommend?
It is not a case of which web server, but the nature of any front end web server that doesn't make it necessarily a suitable host for doing significant forked sub process execution.
The problem is that web servers are usually handling multiple concurrent socket connections from remote HTTP clients. Web servers can have strange setups for log files, including piped loggers, or in process log file rotation mechanisms. And finally web servers can have multiple dynamic worker processes which are spun up and destroyed dynamically as necessary to handle requests.
The consequence of fork/execing a non trivial application is that those sub processes would normally inherit all the open file descriptors/sockets for the parent web server process. In the worst case, a non trivial forked application could interfere with the operation of the web server by interacting with those inherited open connections. You could also have issues where your application assumes only one instance of a forked sub process will be run at a time, since the web server may have multiple worker processes and thus you could have more than one. Finally, if expect your forked sub process to keep running indefinitely you may have issues as it could get killed off when the web server worker process decides to shutdown worker processes.
For that reason, you are better off not directly forking complicated sub processes out of your main public web server application. Instead create a separate more constrained service application whose task is to run specific jobs to do things independent of whether or not it is being done as part of a web request. Then have the front end web server make requests to do that work to that service application.
One way of implementing such a service application for running the jobs is to use an application server such as Celery (https://docs.celeryq.dev/en/stable/).
If you really wanted to still implement your application task server as a web service, then use a very light weight single process web server instead. This should still be behind the front end web server though and often these lightweight servers aren't necessarily as secure and robust as your main web server. For this you might use aiohttpd.
So it is an architectural design issue. If you don't care and WSGIRestrictStdout Off
solves the issues then do that, but I would not use this in anything that is expected to be a robust production deployment.
thank you for your patience and response.. you are doing a greate job.
![Uploading Screenshot 2024-06-27 105234.png…]()