apolbernardo / paper_1

my first paper as data analyst
0 stars 0 forks source link

titanic paper review #12

Open apolbernardo opened 11 months ago

apolbernardo commented 11 months ago

clarification of step by step process test item reliability by means of doing split half and compare a and b item reliability (teaching model part) for example kapag female mataas ang chance na may support from internal if maging consistent sya sa previous step after teaching the model verification of model by using ROC method which test if the output lies on positive or negative

question lang po if my data shows having a good quality thru using reliability test should I bother testing my model since I know if the data was good then it will reflect at the model

docligot commented 10 months ago

The tests for the data (e.g. chi-square test, Information Value) are separate from the tests of the model (e.g. K-S, ROC, CAP, AUC, Confusion Matrix). This is similar to checking p-value of variables vs. checking the p-value of the model via an F-test in linear regression.

Although it should follow that if the variables are good, the model should probably be good too. But there are decisions at the model level that affect accuracy - like where to set the cutoff score.

The steps we discussed yesterday were:

apolbernardo commented 10 months ago

scorecard

Am I on the right track? my next step is ratio between gender and age group who receives support from internal and external

docligot commented 10 months ago

So here, your target variable that you want to score is Employer Discussion vs. Coworker Discussion?

What does the variable signify?

You have to decide which variable you will ultimately want to create a score for. Like in Titanic, the score was to predict survival which is signified by 0 (dead) vs. 1 (survived)

apolbernardo commented 10 months ago

Yes I changed the categorical feature as number in co worker and employer as 0 - no and 1 means yes and then I count them at my second pivot 0 is male while 1 is female and working at tech 0 means no and 1 means yes

docligot commented 10 months ago

Ok, but which is your target variable?

apolbernardo commented 10 months ago

target variables are gender, age, support from employers, support from co workers and if your working at tech company

apolbernardo commented 10 months ago

although Im having confusion if im going to include mental health packages as internal support then its like 2 internal support that is from employer and HMO against 1 external which is support from co worker

docligot commented 10 months ago

I meant by target - is what is the phenomena you want to score. Like in Titanic, it's survival.

The other variables will attempt to explain the target variable

docligot commented 10 months ago

although Im having confusion if im going to include mental health packages as internal support then its like 2 internal support that is from employer and HMO against 1 external which is support from co worker

You can do more than 1 scorecard if you like

apolbernardo commented 10 months ago

I just want to score most likely what to receive mental health support at workplace at tech company vs non tech company

docligot commented 10 months ago

So mental health support is the target variable.

Tech vs. non tech will be one of the scoring variables that will explain the mental health support (along with the other fields)

apolbernardo commented 10 months ago

I was thinking that what if I add MH_share like a clear reflection like if i receive internal support so this is my MH status based on MH _share which means there scores satisfying their mental health like 0-4 is good 5-10 best

docligot commented 10 months ago

You can match the values for MH Share using Weight of Evidence and it becomes a scoring variable to explain the outcome

apolbernardo commented 10 months ago

i am computing for chi square just clarification lang po ang first step is to compute total class counts then divide po sa total population the next step is total population minus (cc/population) and then yung value ng 0/(total - (cc/population)^2- (total - (cc/population)

apolbernardo commented 10 months ago

scorecard.xlsx

hindi ko pa po nachecheck ang chi square sa python, medyo nalito po ako sa chi square and sa p value salamat po

docligot commented 10 months ago

scorecard.xlsx

Added a sheet to your file to explain the calculations.

apolbernardo commented 10 months ago

May I ask how about the gender since I transform the other as 2 how can I compute this at %dist of positive and dist of negative?

apolbernardo commented 10 months ago

it shows that either you are working at tech company or not it does not add contribution to having internal or external MH support my only problem is the gender and after figure it about gender Im going to counter check the chi square at python

docligot commented 10 months ago

May I ask how about the gender since I transform the other as 2 how can I compute this at %dist of positive and dist of negative?

You do it the same way. Btw what is the 0 or 1 in the counts? I saw it as #tech? This means you are really scoring the likelihood to be in a tech company?

apolbernardo commented 9 months ago

yes 0 means no and 1 means yes but I found out the the weight of evidence is 0 so I might disregard this since it seems that knowing if working at tech company does not have direct effect in terms of support the employees expect to receive

apolbernardo commented 9 months ago

I am done with gender but it seems like in chi square the output is #NUM but I will input it at python for verification of chi square and after that I am going to proceed with weight of evidence process. I am having low hopes about my variable gender and tech industry. I can see that they might not be a useful variable to determine support since their weight of evidence is 0 but during my exploratory data analysis it shows that they have variations how come their weight of evidence is 0

docligot commented 9 months ago

can you do a pull request to upload the whole scoring sheet? Let's check it out

apolbernardo commented 9 months ago

doc have you seen po pull request ?? let me know baka nagkamali nanaman po ulit ako sa pull request. salamat po

docligot commented 9 months ago

Saw it. Just merged. Will check the calculation

docligot commented 9 months ago

any updates?

apolbernardo commented 9 months ago

ginagawa ko na po yung sa if statement since yung 2 variable which is si gender at working at tech has 0 weight of evidence kaso yung iba is false dko po alam if lalagyan ko ng less than yung arguments

apolbernardo commented 9 months ago

ang gaagwan po ng scoring is yung train since yan lang po gagamitin sa logistic regression

apolbernardo commented 9 months ago

scorecard_superfinal.xlsx

everything is going well po with %positive(bar) and weight of evidence(line) except tech company and MH coverage

apolbernardo commented 9 months ago

hello po doc update at my work I am working at pivot scoring scorecard_superfinal.xlsx I was working at train only If I am right at interpretation po the higher the score more likely an individual suffers mental health in pivot_mh and if you have hig score more likely lesser funds will be allotted in mh under pivot_mhshare

docligot commented 8 months ago

scorecard_superfinal.xlsx

apolbernardo commented 8 months ago

Trying to ask chatgpt 3.5 about it and d po kaya ng braincells nya hopefully bard could help since kaya po nya ang image prompt

docligot commented 8 months ago

What were you asking chatGPT?

apolbernardo commented 8 months ago

I was asking what are the possible reasons that even my weight of evidence and IV are different values but my score are all the same and yes it went crash but Bard told me that there is a possibility of having outliers and when you look at it the gaps between male and female are close however when you look at the other has a large gap so I was thinking doc maybe just maybe the fact that Other is an outlier that being said it affects the female and male variable.

apolbernardo commented 8 months ago

Variable Weighting:

Are all variables contributing to the gender score weighted equally? It is possible that the variable with the highest weight of evidence is dominating the score, causing other variables with lower weights to have minimal impact.

Suggestion: Review the weightings assigned to each variable and adjust them if necessary to ensure a more balanced evaluation of the gender score.

Data Aggregation:

Is the gender score calculated based on individual data points or aggregated data? If you are using aggregated data, it could mask the true differences between individual data points, resulting in the same score for everyone.

Suggestion: Disaggregate the data and analyze the scores for each individual to identify any underlying patterns or discrepancies.

Scoring Algorithm:

How is the gender score calculated? Is there a specific algorithm or formula being used? A flaw or error in the algorithm could lead to incorrect or consistent scores.

Suggestion: Double-check the scoring algorithm for any potential errors or bias. Consider consulting with the developer or reviewing the documentation for clarification.

These are the other suggestions of Bard na hindi ko po magets kung anu po yun

docligot commented 8 months ago

Did you try Poe.com?

Try uploading a dataset and see if it can analyze it. Pwede natin ayusin next chat.

apolbernardo commented 8 months ago

doc it seems that Poe cannot process excel or images just a text based AI lang po :( Is Gemini free po ba hahahaahahaha but I doubt because Gemini cant count either anyways

apolbernardo commented 8 months ago

I was reading the paper that uses this dataset and they use the dataset to predict the risk of having MH Issue and this statement is from the research "Being the minority, women suffer more from the negative consequences of the gender differences and are seen to have higher rates of mental health concerns. The fractions of female employees in each of the three risk clusters were studied. We observed that a majority of the female employees (62.9%) belonged to the high-risk cluster, 33.1% of them belonged to the medium-risk cluster and only 4% of them were in the low-risk cluster. This is consistent with the findings in the literature on higher rate of 12 months and lifetime diagnosis of any mental health condition in women over men

I was trying to see if the research is quite gender biased the reason of having a large gap in terms of their weight of evidence as you could see that others has the largest WoE that i was thinking that this might be an outlier the reason parang nahatak si female at male

apolbernardo commented 8 months ago

s42979-022-01613-z (2).pdf this is the paper and Im reading the process of creating predictor models with hardcore maths hahahhhahahaa

docligot commented 8 months ago

Finding it fishy that all the accuracy rates are high. Our scorecard should have the same power in theory.

docligot commented 8 months ago

doc it seems that Poe cannot process excel or images just a text based AI lang po :( Is Gemini free po ba hahahaahahaha but I doubt because Gemini cant count either anyways

You can attach documents on some of the chatbots. Not as powerful as GPT-4 yet. Gemini is already available on Poe.

apolbernardo commented 8 months ago

I was thinking about the population like what if al of them female male and others are equal like all of them are 20 because as what Ive read is female more likely to have high scores based on low population and thats the case then male has lower due to population while others has the lowest population thats whay they have the highest that becomes outliers I dont know if we could consider that the question itself is a gender biased one

apolbernardo commented 8 months ago

I dont want to remove others because I think its a new element in the research

apolbernardo commented 7 months ago

hello doc, I was reading the paper and it was stated that they remove 50% with missing values and replacing values for average so the total number of processed participant is 1400 over 1836. I need to remove 436 participants due to having blank responses. I also removed questions that are too subjective Thats the update for now

apolbernardo commented 7 months ago

cleaning as what mentioned they remove 50% with values but the question is saan po sila nagbase as you could see in my snip that yes the MH condition is null but some has past and current MH Condition. If I remove those null values unde MH Condition magkakaroon po ng data integrity issues so I am looking at no chance to remove any data

docligot commented 7 months ago

You need to decide on some rules like if past and current has a value, then MH condition should be yes anyway. We document these rules as part of the paper moving forward.

docligot commented 7 months ago

hello doc, I was reading the paper and it was stated that they remove 50% with missing values and replacing values for average so the total number of processed participant is 1400 over 1836. I need to remove 436 participants due to having blank responses. I also removed questions that are too subjective Thats the update for now

I think this is good to limit the data to information you can verify.

apolbernardo commented 7 months ago

update doc I have remove 52% of my data out of 1386 data I remove 957 from the responses in MH such us possibly and I dont know I think this is the best data quality and the next step is the correction of spellings and since binary na si MH yes and No na lang easier BUt I still kept clean version 1 for reference

apolbernardo commented 7 months ago

fullversion_dataset.xlsx this i so far the update at my dataset hopefully this cleaner version has the best quality data that captures th research aim although at paper they remove 50% only

apolbernardo commented 6 months ago

hello doc I just need to fix issue with regards to 2019 dataset since after cleaning I found that the gender in all 2019 are all missing sudden and I need to fix the first cleaning step input the 2019 dataset and do the If statement to ensure that all data in my cleanest dataset is 100% accurate although excel sometimes is having a hard time I think in analyzing data we do at python because in cleaning organizing and renaming titles I do this in editor query and it will take forever in excel itself so after this I need to move this data somewhere else and the only option is SQL or Python. If excel again it will take forever.

apolbernardo commented 6 months ago

If ever the analysis process will take 2 weeks because March is coming I need to have a paper to slap at employers face hahaahahahaa