Open chriskuz opened 1 week ago
Nulls:
totalTravelDistance
(56,436 elements), segmentsEquipmentDescription
(52,443 elements), segmentsDistance
(19,469 elements)Error labeling segmentsEquipmentDescription
:
segmentsEquipmentDescription
is a column that showcases what kind of aircraft was flown for a route ||
symbol||
delimiter; a lack of information for leg routes; inconsistency with similar aircraft naming; and blatant fake aircraft generation). There exists an idea where a more consistent column that correctly uses a ||
delimiter could help us fill in the gaps on any missing information for this. However, this would likely mean we would need smart use of regex
which is really annoying.
Duplicated Features (segmentsAirlineCode
, segmentsAirlineName
):
segmentsAirlineCode
column as well as the segmentsAirlineName
column is clean and consistent enough to be considered as a helper column for the cleanup of segmentsEquipmentDescription
. segmentsAirlineName
column is a full name and the segmentsAirlineCode
column is the abbreviated prefixed callsign of the airline.It is likely best for us to consider the removal of the impure routes which will shorten the table a little bit. Also, since this will create pure JetBlue routes, we can consider the utter removal of this column as well as the airline code column. The filter must happen first based on these helper columns before a consideration of removing the columns. There might be a case to keep at least one of these columns to take advantage of the delimiters which may help a model understand the price correlation with multi leg routes. No resolution found for now.
Going to be adding highlights to this thread periodically for tracking better some of the work that went into understanding the data.