Some FASTQ files on AWS missing reads?

tdurham86 commented 4 years ago

Hi,

I have been trying to download the raw data for selected FACS-isolated cells. I downloaded a metadata spreadsheet containing the FASTQ locations from AWS here:

s3://czb-tabula-muris-senis/Metadata/tabula-muris-senis-facs-official-raw-objcell-metadatacleaned_ids__read1_read2.csv

I filtered for particular cell types of interest, and then additionally filtered for cells with high read/gene counts by cross-referencing the metadata with another spreadsheet that I downloaded from GEO (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM4505405):

GSM4505405_tabula-muris-senis-facs-official-raw-obj-metadata.csv

I joined the two spreadsheets on the index field (GEO spreadsheet)/obs_names field (AWS spreadsheet).

I noticed that at least some of the FASTQ files from AWS have many fewer reads than I expected based on the GEO metadata. For example, cell A10-D042044-3_9_M-1-1 has 4,111 genes detected and 11915818 counts according to the GEO metadata, but when I download the FASTQ files from the following S3 keys given in the AWS spreadsheet the resulting files have fewer than 200 reads:

s3://czb-tabula-muris-senis/Plate_seq/3_month/170907_A00111_0052_AH2HTCDMXX/fastqs/A10-D042044-3_9_M-1-1_R1_001.fastq.gz s3://czb-tabula-muris-senis/Plate_seq/3_month/170907_A00111_0052_AH2HTCDMXX/fastqs/A10-D042044-3_9_M-1-1_R2_001.fastq.gz

I have double-checked to make sure that this is not just a corrupted download issue. The size of the two FASTQ files in the AWS S3 bucket is quite small -- about 28 KB -- and I also noticed that the other FASTQs in that S3 directory are also small. Most other FASTQ files in the other directories are in the tens of MB, but these are in the tens of KB. Was there an issue uploading some of the FASTQ files to S3? Or am I trying to download the wrong FASTQ files? Any help you could provide would be greatly appreciated! Thank you!

tdurham86 commented 4 years ago

I just wanted to ping this issue again. Can anyone help? Did I direct my inquiry to the right place?

Thanks!

aopisco commented 4 years ago

@tdurham86 you are downloading the FASTQS from the right place. I'm not entirely sure about the metadata file you are using -- @olgabot can you comment on this csv?

tdurham86 commented 4 years ago

Thanks @aopisco and @olgabot. It would be great to know if I'm using the correct csv metadata. I have encountered various inconsistencies in the cell and file naming conventions that make it difficult to automatically download data for all cells of interest. In particular, I found that some of the S3 keys listed in the tabula-muris-senis-facs-official-raw-objcell-metadatacleaned_ids__read1_read2.csv spreadsheet do not exist. For example, for cell H7_B001397 the csv file lists the S3 key as:

s3://czb-tabula-muris-senis/Plate_seq/24_month/180813_A00111_0188_AH7G2FDSXX__180831_A00111_0201_BH7WGCDSXX/H7_B001397_S43_L001_R1.fastq.gz

But, as far as I can tell the base name of the actual file in S3 does not have the '_L001' and is found here:

s3://czb-tabula-muris-senis/Plate_seq/24_month/180813_A00111_0188_AH7G2FDSXX__180831_A00111_0201_BH7WGCDSXX/H7_B001397_S43_R1.fastq.gz

Also, for the 3m and 18m cells using the values in the 'obs_names' column of the AWS spreadsheet to look up rows in the GEO spreadsheet 'index' column works just fine, but for some reason the rows are named differently for the 24m cells and this lookup does not work. Is there a combination of metadata fields that are globally unique that can be used to look up cells in either spreadsheet?

Thanks!

tdurham86 commented 4 years ago

Hi @aopisco and @olgabot , I just wanted to ping this issue again. Any updates on either the fastqs with few reads or whether I am using the correct metadata tables? Thanks.

czbiohub-sf / tabula-muris-senis

Some FASTQ files on AWS missing reads? #14