SATAY-LL / LaanLab-SATAY-DataAnalysis

This contains codes and workflows for data analysis regarding SATAY experiments.
Apache License 2.0
4 stars 3 forks source link

Interpretation of sequencing data obtained from NovoGene #30

Open EKingma opened 3 years ago

EKingma commented 3 years ago

We got back 150 bp read data from NovoGene from the SATAY experiments. Based on a manual inspection of the data, we see that:

Gregory94 commented 3 years ago

I tried two different tools for demultiplexing the data;

  1. sabre
  2. cutadapt

Both tools work, but both also give a lot of reads which are discarded. I am not sure yet whether this is because;

I need to look more closely at the data and the results to be sure what the cause it and how to solve it. The results of both tools are stored on the Linux machine at "~/Documents/satay/datasets/wt1_enzo_dataset" (it is still on the Linux computer because the data transfer between the Linux machine and the drives is somewhat troublesome, so I can't store it on the N-drive).

EKingma commented 3 years ago

Hi Greg,

I had a look at the files from the Sabre demultiplexing and I did not really understand why the barcodes file (wt1_enzo_barcodes.txt) has only 2 of the 4 barcodes

Gregory94 commented 3 years ago

I followed the tutorial on the Sabre github page, and from what I understood you can only enter one barcode per sample. That is probably why it didn't get many of the pairs. Maybe I should run it a second time, now with the other (unused) barcodes on the file with all the unmatched reads.

Gregory94 commented 3 years ago

I am writing a small python code that loops over the sequencing reads and checks what the barcode is in each sample. I notice that

I guess this is why the other tools discard many of the reads. The question is what to do with those reads?

This is a snippet from the output of the code for first 10 reads where first the headers of the reads are shown from the two files (those should be almost identical) and then the two barcodes that are found (if any) in the sequences are shown:

@A00153:690:HJVN5DSXY:1:1101:1470:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:1470:1000 2:N:0:GAACGTGA+AACTGGTG

GATAGACA Sample2_rev
GAGCTGAA Sample2_for

@A00153:690:HJVN5DSXY:1:1101:3568:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:3568:1000 2:N:0:GAACGTGA+AACTGGTG

GCCACATA Sample1_for
GCGAGTAA Sample1_rev

@A00153:690:HJVN5DSXY:1:1101:4056:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:4056:1000 2:N:0:GAACGTGA+AACTGGTG

GCCACATA Sample1_for
GCGAGTAA Sample1_rev

@A00153:690:HJVN5DSXY:1:1101:4273:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:4273:1000 2:N:0:GAACGTGA+AACTGGTG

GCGAGTAA Sample1_rev
GCCACATA Sample1_for

@A00153:690:HJVN5DSXY:1:1101:5267:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:5267:1000 2:N:0:GAACGTGA+AACTGGTG

GAGCTGAA Sample2_for
[]

@A00153:690:HJVN5DSXY:1:1101:6298:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:6298:1000 2:N:0:GAACGTGA+AACTGGTG

GCCACATA Sample1_for
GCGAGTAA Sample1_rev

@A00153:690:HJVN5DSXY:1:1101:6388:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:6388:1000 2:N:0:GAACGTGA+AACTGGTG

GCGAGTAA Sample1_rev
GCCACATA Sample1_for

@A00153:690:HJVN5DSXY:1:1101:7256:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:7256:1000 2:N:0:GAACGTGA+AACTGGTG

GAGCTGAA Sample2_for
GATAGACA Sample2_rev

@A00153:690:HJVN5DSXY:1:1101:8558:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:8558:1000 2:N:0:GAACGTGA+AACTGGTG

GCGAGTAA Sample1_rev
GCCACATA Sample1_for

@A00153:690:HJVN5DSXY:1:1101:9028:1000 1:N:0:GAACGTGA+AACTGGTG

@A00153:690:HJVN5DSXY:1:1101:9028:1000 2:N:0:GAACGTGA+AACTGGTG

GCGAGTAA Sample1_rev
GAGCTGAA Sample2_for
Gregory94 commented 3 years ago

For clarification of the above problem: There are two raw datasets, one with the forward reads and one with the reverse reads that both contain 2 samples. These samples need to be separated. Every read has a header with a unique number so you can identify which reads are a pair (e.g. in the snippet in the previous message, the first two headers both contain the unique number 1470, meaning that they belong together so that one read is the forward read and one is reverse read). Remember that there are 2 samples in these datasets that we need to separate based on a barcode sequence (a 8bp sequence in our case). This barcode should be in the sequence read and should refer to the same sample for reads that form a pair. For example, if I find that a forward read contains a barcode for sample 1, then I expect that its paired read (the reverse read) also contains a barcode that indicates that it belongs to sample 1. But this is not what I always find (e.g. the last read pair in the snippet of the previous message). Moreover, some reads don't seem to contain any barcode at all (e.g. the read pair that contains the number 5267 in their headers). I think this is also why the software tools discard so many of our reads because they don't what to do in these cases. See the below figure for a schematic overview of the issue. image

These are the barcode sequences:

EKingma commented 3 years ago

For clarification of the above problem: There are two raw datasets, one with the forward reads and one with the reverse reads that both contain 2 samples. These samples need to be separated. Every read has a header with a unique number so you can identify which reads are a pair (e.g. in the snippet in the previous message, the first two headers both contain the unique number 1470, meaning that they belong together so that one read is the forward read and one is reverse read). Remember that there are 2 samples in these datasets that we need to separate based on a barcode sequence (a 8bp sequence in our case). This barcode should be in the sequence read and should refer to the same sample for reads that form a pair. For example, if I find that a forward read contains a barcode for sample 1, then I expect that its paired read (the reverse read) also contains a barcode that indicates that it belongs to sample 1. But this is not what I always find (e.g. the last read pair in the snippet of the previous message). Moreover, some reads don't seem to contain any barcode at all (e.g. the read pair that contains the number 5267 in their headers). I think this is also why the software tools discard so many of our reads because they don't what to do in these cases. See the below figure for a schematic overview of the issue. Demultiplexing_issue.pdf

These are the barcode sequences:

  • sample 1 forward read: GCCACATA
  • sample 1 reverse read: GCGAGTAA
  • sample 2 forward read: GAGCTGAA
  • sample 2 reverse read: GATAGACA

Hi Greg,

Great that you could make a working Python code for this. Did you process the entire file? If so, could you please tell me where the results are stored (or copy the results to the N drive)? Then I can have a look at it. Would it be possible to separate out the cases where one of the barcodes is missing from those that seem to have mixed barcodes?

Gregory94 commented 3 years ago

I have not yet processed the entire file, only the first few hundred reads or so. I am trying to implement the demultiplexing myself, because it can't be that hard to do. It is just deciding the what to do with the cases where the barcodes are dubious.

Gregory94 commented 3 years ago

I'll put an update on the progress for demultiplexing the dataset. I've written a Python code that can demultiplex the dataset which initially seemed to work just fine. But apparently it doesn't like big datafiles, so it crashed the computer when running it on the actual dataset. To solve this, I translated this code to Bash, which has special functions for handling text files which makes it way more efficient. This also works and with this code I'm able to demultiplex the files without any trouble (it takes somewhere between 0.5 and 1 hour). I have now four datasets:

  1. sample 1 forward reads,
  2. sample 1 reverse reads,
  3. sample 2 forward reads,
  4. sample 2 reverse reads.

However, the problem which I now have is that files are in a wrong order. Normally the alignment files for the same sample are ordered the same way (i.e. read_1 in the forward file pairs with read_1 in the reverse file, read_2 in forward with the read_2 in reverse etc.). But, this is not the case in demultiplexed files and therefore I am afraid that the alignment software is going to pair the wrong reads with each other and therefore the alignment will be inaccurate. Now, there is a tool (that belongs to the same software package that I use for sequence trimming) that claims it can solve this issue, but it doesn't like such large files as well and so it crashes.

What I now want to try is not to create four demultiplexed read files (what I now have), but two files where the forward and reverse reads are interleaved. So you end up with the following files:

  1. sample 1 forward & reverse reads,
  2. sample 2 forward & reverse reads.

This would than also have a wrong order initially, but this is potentially easier to solve.

EKingma commented 3 years ago

Hi Greg, do these files that you have now still have the headers connected to the reads? In that case it is fine and you do not need to worry about the order of files

Gregory94 commented 3 years ago

I have now two demultiplexed files that includes both the forward and reverse reads from a single sample. I want to check if the demultiplexing is good, because I didn't had the opportunity to check the data as the remote control of the Linux computer is not flawless and I have some hard time connecting to it. But there should be two files in the 'dataset' folder on the Linux machine that are demultiplexed.

Gregory94 commented 3 years ago

I have posted a question on Biostars regarding the demultiplexing of the dataset. I don't know if it is possible to change the way the barcodes are integrated in the sequences for future experiments? If yes, maybe we should reconsider the barcode location.

Gregory94 commented 3 years ago

I have found a potential solution for the demultiplexing problem. This solution searches for barcodes in the reads and stores both the forward and reverse reads in the same file (i.e. interleaved paired-end fastq file). It uses the following steps:

  1. grep --no-group-separator -h -A 2 -B 1 -E 'GCCACATA|GCGAGTAA' $file1 $file2 > $outputfile This searches for all lines in both fastq files that contain either of the two barcodes (in this example those from sample 1). When a line contains one of the barcodes, it stores that line (the actual sequence), the previous line (the header)(-B 1) and the next two lines (the quality score of the read)(-A 2) to an output file. The reads from both input files are stored in a single output file (the output file is thus interleaved).

  2. cat $outputfile | paste - - - - | sort -k1,1 -S 1G | tr '\t' '\n' > $outputfilesorted The disadvantage of the first step is that the read pairs are not ordered. All the downstream processes expect that read pairs are stored together like this:

    • readpair 1, forward read
    • readpair 1, reverse read
    • readpair 2, forward read
    • readpair 2, reverse read
    • etc. To accomplish this, the reads are sorted based on their headers so that the reads that form a pair are subsequent reads. This is stored in another output file that should have the same size as the output file from the first step.
  3. bash bbsplitpairs.sh -Xmx1g in=$outputfilesorted out=$outputpair outs=$outputsing fint After step 2 the fastq files are correctly formatted, but there are still singletons in the files. This indicates reads that do not have a paired partner because the paired partner either has a barcode that belongs to another sample or does not contain any barcodes at all. Those singleton reads have to be removed from the files in order for the downstream processes to work correctly. This can be done by a tool that is available in the bbmap package. This package is later used for trimming of the reads using bbduk.sh, but also contains the bbsplitpairs.sh tool. This sorts out the reads that do not have a paired partner in the same sample and save those reads separately in the file given by outs.

These steps are repeated with the barcodes from sample 2.

When this is ran for the wt dataset, it finds 53756670 pairs for sample 1 (which is 88.9% of the total number of reads that contain a barcode from sample 1) and for sample 2 it finds 59550160 pairs (91.2% of the total number of reads that contain a barcode from sample 2). In total there are 134106512 reads in the raw files and there are 113306830 read pairs in the output files for sample 1 and sample 2 combined, which is 84.49% of the reads that are present in the raw datafile.

Gregory94 commented 3 years ago

Here I present a more thorough overview of some of the steps I have been taking in the processing pipeline the SATAY data. This is to clarify some things I have been doing and some issues I am facing. I take the following processing steps:

  1. Demultiplexing For this I have created a custom workflow specific for this dataset. This step will likely be different for different datasets depending how the data is formatted. The commands for the current dataset are discussed in the previous comment. This creates interleaved paired end fastq files for all samples based on the barcodes present in the sequences which seems to work fine. I have written a python code to check the format of the fastq files and all reads are stored as a pair (i.e. there are not singleton reads), the right barcodes are present in each read and the general format of the files is good. Also I use fastqc that checks the quality of the data and sequences that frequently occur (e.g. primers and barcodes). This also shows that the demultiplexing is good as the barcodes from sample x does not occur frequently in sample y and vice versa.

  2. Trimming For this I use bbduk. Choosing the right settings seems to be quite important as this influences the the quality of the alignment. Currently I input the whole 50bp primer sequences (including the barcodes) and those are trimmed using kmers. The kmers should not be too short as then they have low specificity and changes are that false positives occur. Typically I use somewhere around k=20, only look for primers and adapters in the first 50bp of the sequences and trim all reads to the left when a match is found as the primer sequences occur at the start of the sequences. The pairs are trimmed together, so that when a primer sequence was found in one of the reads of a pair the reads still have the same length after adapter trimming (which is advised to do according to the bbduk documentation). Typically not many reads are lost (<0.5%). Here is an example for sample 2 from Enzo dataset:

    
    Input:                      59550106 reads      8932515900 bases.
    QTrimmed:                   2 reads (0.00%)     20 bases (0.00%)
    KTrimmed:                   59545334 reads (99.99%)     2919778017 bases (32.69%)
    Trimmed by overlap:         347645 reads (0.58%)    7519706 bases (0.08%)
    Total Removed:              0 reads (0.00%)     2927297743 bases (32.77%)
    Result:                     59550106 reads (100.00%)    6005218157 bases (67.23%)

Time: 416.036 seconds. Reads Processed: 59550k 143.14k reads/sec Bases Processed: 8932m 21.47m bases/sec

What the best trimming steps are is really depending on the data. The current steps might be improved, but how exactly I am not sure but possibly with trial-and-error I will see when the quality of the data after trimming is best.

3. Alignment
Alignment is done with [bwa mem](http://bio-bwa.sourceforge.net/bwa.shtml). When running this in paired-end mode for the current dataset this gives some warnings.

[M::mem_pestat] analyzing insert size distribution for orientation FR... [M::mem_pestat] (25, 50, 75) percentile: (160, 286, 286) [M::mem_pestat] low and high boundaries for computing mean and std.dev: (1, 538) [M::mem_pestat] mean and std.dev: (237.82, 93.12) [M::mem_pestat] low and high boundaries for proper pairs: (1, 664) [M::mem_pestat] skip orientation RF as there are not enough pairs [M::mem_pestat] skip orientation RR as there are not enough pairs [M::mem_pestat] skip orientation FF as there are not enough pairs

I am not yet entirely convinced what this means and if this is something concerning or not. The output file of the alignment (the sam file) can be checked using [samtools flagstat](http://www.htslib.org/doc/samtools-flagstat.html).
This the flagstat for sample 2:

59692730 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 142624 + 0 supplementary 0 + 0 duplicates 32325475 + 0 mapped (54.15% : N/A) 59550106 + 0 paired in sequencing 29775053 + 0 read1 29775053 + 0 read2 17298410 + 0 properly paired (29.05% : N/A) 22173236 + 0 with itself and mate mapped 10009615 + 0 singletons (16.81% : N/A) 3923564 + 0 with mate mapped to a different chr 3494300 + 0 with mate mapped to a different chr (mapQ>=5)


This shows that only about 54% of the reads are actually mapped. This is not much compared to the number of reads that were initially sequenced, but the absolute number of mapped reads is comparable to the number of mapped reads in the dataset from benoit. Maybe more worrying is that only 29% of the reads are mapped as actual pairs. This is also visible using [IGV](https://software.broadinstitute.org/software/igv/) where many reads are colored (see figure below).
[This indicates that the paired partner read is mapped to another chromosome](https://software.broadinstitute.org/software/igv/interpreting_insert_size).
The cause of this I don't know, but it probably has to do with one of the preceding processing steps.
<img width="960" alt="IGV_sample2_snippet" src="https://user-images.githubusercontent.com/29129193/102622665-a3b28f00-4141-11eb-8e2b-3ef3e63dae25.PNG">

4. Transposon mapping
For transposon mapping I still use the [python code](https://github.com/Gregory94/LaanLab-SATAY-DataAnalysis/blob/master/Python_TransposonMapping/transposonmapping_satay.py) that was created based on the Matlab code from Benoit. One issue that occurred in the code has to do with the flags assigned to each read during alignment. These flags tells something how the reads were mapped (see [this website for an overview of all possible flags](https://broadinstitute.github.io/picard/explain-flags.html)). In the code from Benoit only two different flags are defined: 0 for the forward reads and 16 for the reverse reads (e.g. see lines 188 and 189 in the current code). But for the paired end reads the flags are different. These different flags need yet to be included in the python code, but I am thinking how to do this in somewhat more robust way so that the code will also work with future datasets where the flags might be different again than in the current dataset.
Gregory94 commented 3 years ago

This is a follow up on point 4 raised in the previous comment on the flag handling in the transposon mapping python code. This issue has now been resolved and the python code can handle flags from both paired and single read datasets. For this I have created samflag.py that converts a flag (i.e. a decimal number) to a 12-bit binary number. This binary number indicated which parameters hold for a specific read. For more information see this website. Samflag is a python function that can be used for other scripts as well. I noticed that from the alignment step there are reads that are either unmapped (likely because the read couldn't fit to any location on the reference genome (S288C)) or have two mapping locations. The unmapped reads are ignored in the transposonmapping software and for the reads that have multiple alignments, only the alignment that was preferred by the alignment software (e.g. because it has a slightly higher mapping score) is used. The other read (called the Secondary Alignment (SA)) is ignored as well. This filtering is based on the sam flags. So far I have analyzed dataset1 from Enzo (dBem1dBem2dBem3dNrp1 strain) (The results are stored at N:\tnw\BN\LL\Shared\Gregory\datasets\wt1_enzo_dataset\wt1_enzo_dataset_demultiplexed_interleaved_sample1\wt1_enzo_dataset_demultiplexed_interleaved_sample1_trim2\align_out). When I compare the results from this dataset with those from Benoit (e.g. using this code that show the read distribution per chromosome) it shows that the read density is >2 times higher in the Enzo-dataset compared with the Benoit-dataset. This is even when many reads are ignored due to previous steps in the processing of the Enzo-dataset (see my previous comment). I also see that around the centromeres and in the Ade2-gene the insertion density is higher, just like in the Benoit-dataset. However, I also see that there are still reads in annotated essential genes (of course this isn't a wild type strain which might mess with the essentiality of genes) and I see that there are many insertions in Bem2, which is not what we expect (there are only few to no insertions in Bem1, Bem3 and Nrp1 which is what we expect and the few insertions that those genes have might be due to alignment uncertainty and noise). Below I have included the read distribution for chromosome 3 where on the left I have plotted Benoit-dataset (wt1) and on the right I have plotted Enzo-dataset 1. Note the difference in y-scale (this is created using this code).

datasetcompare_benoit_enzo_chrIII
Gregory94 commented 3 years ago

I have reanalyzed sample 2 (wild type strain) from Enzo's dataset. This time I have treated the data as being single-end reads. For this I took the reads that contained the sequencing primer (TTTTACCGACCGTTACCGACCGTTTTCATCCCTA) which should correspond to the tn-insertion site. The discarded reads contained the NlaIII or the DpnII ligation sequence and corresponds to the mate reads of the reads containing the sequencing primer. With this approach I was left with about 37% of the reads (i.e. 24063554 reads) compared with the number of reads that originally were sequenced for sample 2 (that might appear to be little, but remember we lose at least half of all reads since we discard the mate reads). Those reads were trimmed for the primer and adapter sequences and aligned to the reference genome as single-end reads. This allows about 15265731 reads to be aligned and those can be used for analyzing the transposon insertions. The results are stored at "X:\tnw\BN\LL\Shared\Gregory\datasets\wt1_enzo_dataset\wt1_enzo_dataset_demultiplexed_interleaved_sample2\wt1_enzo_dataset_demultiplexed_singleend_sample2_trim1"

When I do some checking of the reads I see that in most of the annotated essential genes there are less transposons compared to other regions, but this is not always very clear (some annotated essential genes still have relatively many reads). Also our data appears to be more noisy has less 'contrast' compared to the Benoit dataset. With this I mean that on average we appear to have less reads per insertion location and there are locations that have exceptionally many reads. This might make it even more tricky to estimate the relative fitness of gene inhibitions. Below are two figures of chromosome IX where the second graphs are zoomed in slightly compared to the first graphs. On the left is Benoit's dataset and on the right is Enzo's dataset.

chrIX_1 chrIX_2

Maybe other trimming and alignment settings might improve the contrast a bit, but at least there is some improvement compared with the prior methods I have tried that used paired-end mapping for the reads where I don't see much change in the number of reads in essential genes compared to the non-essential genes. I will check the sample 1 dataset as well to see if there appear no insertions in the deleted genes.

leilaicruz commented 3 years ago

That is great to kow :) Can you also plot the statistics of the library to compare it more thoroughly with Benoit dataset , like the ones in the table below image

With this I mean that on average we appear to have less reads per insertion location and there are locations that have exceptionally many reads

why this is a problem? it could be a real fact from the data... that is why it is good to know the resolution of the transposition , how many transposons in average are inserted per bp? For benoit it was approximately every 160bp right?

EKingma commented 3 years ago

Hi Greg, nice that it worked, but the uneven distribution of reads throughout the genome is indeed troublesome. Do you maybe know why 10 million of the reads that were left after taking only those that contain the sequencing primer from Benoit could aligned? And how many mismatches did you allow in the alignment of the sequencing primer?

It would be good to have the number of transposons that were mapped such that we can see if the lower amount of contrast between essential/non-essential genes is due to a difference in the number of transposon insertions or due to the amount of reads we have. What it looks like now is that we might have filtered for locations that survive this process of trimming and alignment, which could cause the peaks in your alignment plots.

Gregory94 commented 3 years ago

I have tried to determine the statistics similar to the ones Benoit shows in his paper, but I fail to reproduce the values presented in the paper. Probably I am doing something wrong here, but nevertheless here are the number as I have them so far for Enzo's dataset. I'll update the numbers as soon as figured out what's wrong.

statistics_insertions_and_reads

@EKingma

Do you maybe know why 10 million of the reads that were left after taking only those that contain the sequencing primer from Benoit could aligned?

I don't know, I haven't stored the rejected reads. I should recreate the dataset for this with the rejected reads.

More interestingly, I have also plotted the number of transposon instead the number of reads. The number of transposons in less noisy compared to the read count. This show some similarities with Benoits dataset (see figure below, note the difference in y-scales). This clearly indicates the centromere and Ade2-gene bias in transposition. Also there is a large peak in chromosome 12 around bp 460000. I looked at this location on sgd, but I couldn't find anything interesting that would explain such a large peak. There are mainly non-coding regions and ribosomal RNA sequences and two small genes which I don't expect them to cause such a transposition bias. Also in the number of reads a high density of reads are visible. tn_count_genome_comparisson_20210204

Note: I used these dataset for creating the above plot: